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Cover Design 

The images on the front and back covers 
of this issue are different visualizations 
of the same data output from a regional 
climate simulation program run by Dr. 
John Roads of the Scripps Institution of 
Oceanography. The data depicted con- 
tain measures of temperature, liquid and 
gaseous water content, and wind vectors; 
the topography represented by the data 
is the western U.S. in January 1990. Pro- 
viding earth scientists with the ability to 
visualize such data is one of the objectives 
of the Sequoia 2000 research project — 
a joint effort of the University of California, 
government agencies, and industry to build 
a computing environment for global change 
research. This issue presents papers on sev- 
eral major areas explored by Sequoia 2000 
researchers, including an electronic reposi- 
tory, networking, and visualization. 

The cover was designed by Lucinda O'Neill 
of Digital's Design Group. Special thanks go 
to Peter Kochevar for supplying the cover 
images. 
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Editor's 
Introduction 



Scientists have long been motivators 
for the development of" powerful 
computing environments. Two 
sections in this issue of the Journal 
address the requirements of scientific 
and technical computing. The first, 
from Digital's High Performance 
Technical Computing Group, looks 
at compiler and development tools 
that accelerate performance in parallel 
environments. The second section 
looks to the future of computing; 
University of California and Digital 
researchers present their work on a 
large, distributed computing environ- 
ment suited to the needs of earth sci- 
entists studying global changes such 
as ocean dynamics, global warming, 
and ozone depletion. Digital was an 
early industry sponsor and participant 
in this joint research project, called 
Sequoia 2000. 

To support the writing of parallel 
programs for computationally intense 
environments, Digital has extended 
DEC Fortran 90 by implementing 
most of High Performance Fortran 
(HPF) version 1 .1 . After reviewing 
the syntactic features of Fortran 90 
and HPF, Jonathan Harris et al. focus 
on the HPF compiler design and 
explain the optimizations it performs 
to improve interprocessor communi- 
cation in a distributed-memory envi- 
ronment, specifically, in workstation 
clusters (farms) based on Digital's 
64-bit Alpha microprocessors. 

The run-time support for this dis- 
tributed environment is the Parallel 
Software Environment (PSE). Ed 
Benson, David LaFrance- Linden, 
Rich Warren, and Santa Wiryaman 
describe the PSE product, which is 
layered on the UNIX operating sys- 
tem and includes tools for developing 



parallel applications on clusters of up 
to 256 machines. They also examine 
design decisions relative to message- 
passing support in distributed systems 
and shared-memory systems; PSE 
supports network message passing, 
using TCP/IP or UDP/IP protocols, 
and shared memory. 

Michael Stonebraker's paper opens 
the section featuring Sequoia 2000 
research and is an overview of the 
project's objectives and status. The 
objectives encompassed support for 
high-performance I/O on terabyte 
data sets, placing all data in a DBMS, 
and providing new visualization tools 
and high-speed networking. After 
a discussion of the architectural layers, 
he reviews some lessons learned by 
participants — chief of which was to 
view the system as an end-to-end 
solution — and concludes with a look 
at future work. 

An efficient means for locating 
and retrieving data from the vast 
stores in the Sequoia DBMS was 
the task addressed by the Sequoia 
2000 Electronic Repository project 
team. Ray Larson, Chris Plaunt, 
Allison Woodruff, and Marti Hearst 
describe the Lassen text indexing 
and retrieval methods developed 
for the POSTGRES database system, 
the GIPSY system for automatic index- 
ing of texts using geographic coor- 
dinates discussed in the text, and the 
TextTiling method for automatic 
partitioning of text documents to 
enhance retrieval. 

The need for tools to browse 
through and to v isualize Sequoia 
2000 data was the impetus behind 
Tecatc, a software platform on which 
browsing and visualization applica- 
tions can be built. Peter Kochevar 



and L.en Wanger present the features 
and functions of this research proto- 
type and offer details of the object 
model and the role of the interpre- 
tive Abstract Visualization Language 
(AVI.) for programming. Thev con- 
clude with example applications that 
browse data spaces. 

The challenge of high-speed net- 
working for Sequoia 2000 is the sub- 
ject of the paper by Joseph Pasquale, 
Kric Anderson, Kevin Fall, and Jon 
Kay. In designing a distributed system 
that efficiently retrieves, stores, and 
transfers verv large objects (in excess 
of tens or hundreds of megabytes), 
they focused on operating system 
I/O and network software. They 
describe two I/O system software 
solutions — container shipping and 
peer-to-peer I/O — that avoid data 
copying. Their TCP/IP network 
software sol utions center on avoiding 
or reducing checksum computation. 

The editors thank Jean Bonney, 
Digital's Director of External 
Research, for her help in obtaining 
the papers on Sequoia 2000 research 
and for writing the Foreword to this 
issue. 

Our next issue will feature papers 
on multimedia and UNIX clusters. 

Jane C. Blake 
Men/aging Editor 



Oiyirnl Technical Journal 



Vol, 7 No 3 1995 



Foreword 




Jean C. Bonney 

Director, I'.xlernul lieaearcb 

The Information Utility, the 
Information Highway, the Internet, 
the Infobahn, the Information 
Economy — -the sound bytes of the 
1990s. To make these concepts 
reality, a robust technology infra- 
structure is necessary. In 1990, 
Digital's research organization saw 
this need and set out to develop an 
experimental test bed that would 
examine assumptions and provide a 
basis for a technology edge in the '90s. 
The resulting project was Sequoia 
2000, a three-year research collabora- 
tion between Digital, campuses of the 
University of California, and several 
other industry and government orga- 
nizations. The Sequoia 2000 vision is 

I'etabyles I i.e.. trillions of bytes! 
of data in a distributed archive, 
transparently managed, and 
logically viewed over a highspeed 
network with isochronous capabilities 
via a host of tools 

— in other words, a big, fast, easy-to- 
use system. 

Although the vision is still not reality 
today, our more than three years 
of participation in Sequoia 2000 
research gave us the knowledge base 
we sought. 



After a rigorous process of pro- 
posal development and review by 
experts at Digital and the University 
of California, Sequoia 2000 began 
in June 199 1 . The focus of the 
research was a high-speed, broad- 
band network spanning University 
of California campuses from Berkeley 
to Santa Barbara, Los Angeles, and 
San Diego; a massive database; stor- 
age; a v isualization system; and elec- 
tronic collaboration. Driving the 
research requirements were earth 
scientists. The computing needs of 
these scientists push the state of the 
art. Current computing technologies 
lack the capabilities earth scientists 
need to assimilate and interpret the 
vast quantities of information col- 
lected from satellites. Once the data 
are collected and organized, there is 
the challenge of massive simulations, 
simulations that forecast world climate 
ten or even one hundred years from 
now. These were exactly the kinds 
of challenges the computer scientists 
needed . 

Among the major results of three 
years of work on Sequoia 2000 was 
a set of product requirements for 
large data applications. These require- 
ments have been validated through 
discussions with customers in finan- 
cial, healthcare, and communications 
industries and in government. The 
requirements include 

■ A computing environment built 
on an object relational database, 
i.e., a data-centric computing 
system 

■ A database that handles a wide 
variety of nonrraditional objects 
such as text, audio, video, graph- 
ics, and images 



■ Support for a variety of traditional 
databases and file systems 

■ The ability to perform necessary 
operations from computing 
environments that are intuitive 
and have the same look and fee); 
the interface to the environment 
should be generic, very high level, 
and easily tailored to the user 
application 

■ High-speed data migration 
between secondary and tertiary 
storage with the ability to handle 
very large data transfers 

■ Network bandwidth capable 
of handling image transmission 
across networks in an acceptable 
time frame with quality guarantees 
for the data 

■ High-quality remote visualization 
of any relevant data regardless 

of format; the user must be able 
to manipulate the visual data 
interactively 

■ Reliable, guaranteed, delivery 
of data from tertiary storage to 
the desktop 

Sequoia 2000 was also a catalyst 
for maturing the POSTGRES research 
database software to the point where 
it was ready for commercialization. 
The commercial version, 1 1 lustra, 
is available on Alpha platforms and 
is enjoying success in the banking 
industry and in geographic informa- 
tion system (CIS) applications, as 
well as in other government applica- 
tions with massive data requirements. 
Illustra is also making inroads into the 
Internet where it is used by on-line 
services. 

Yet another major result of Sequoia 
2000 was a grant from the National 
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Aeronautics and Space Administra- 
tion (NASA) to develop an alternate 
architecture for the Earth Observing 
Svstem Data and Information System 
(EOSDIS). EOSDIS will process the 
petabytes of real-time data from 
the Earth Observing System (EOS) 
satellites to be launched at the end 
of the decade. The alternate infor- 
mation architecture proposed bv the 
University of California ("acuity was 
the Sequoia 2000 architecture. It 
will have a major influence on the 
EOSDIS project. 

For the earth scientists, gains 
were made in simulation speeds and 
in access to large stores of organized 
data. These scientists used some of 
Digital's first Alpha workstation farms 
and software prototypes for their cli- 
mate simulations. An eight-processor 
Alpha workstation farm provided a 
two-to-one price/performance advan- 
tage over the powerful, multimillion- 
dollar CRAY C90 machine. In another 
earth science application, scientists 
using Alpha and hierarchical storage 
systems could simulate two years' 
worth of climate data over the week- 
end without operator intervention; 
formerly, two months' worth of data 
took one dav to simulate and required 
considerable operator intervention. 
Thus many more simulations could 
be processed in a fixed time and 
"time to discovery" was decreased 
considerably. 

Now that we can look at Sequoia 
2000 in retrospect, would we do 
such a project again? The answer 
is a resounding "ves" from all of 
us involved. It was a complex proj- 
ect that included 12 University of 
California faculty members, 25 grad- 
uate students, and 20 staff. Another 



8 faculty members and students pro- 
vided additional expertise. Four of 
Digital's engineers worked on site, 
and a variety of support personnel 
from other industry sponsors partici- 
pated, including SAIC, the California 
Department of Water Resources, 
Hewlett-Packard, Metrum, United 
States Geological Survey (USGS), 
Hughes Application Information 
Services, and the -Army Corps of 
Engineers. 

Bur as is the case with such ambi- 
tious projects, there were unantici- 
pated and difficult lessons for all 
to learn. To experiment with real- 
life rest beds means considerably 
more than writing a rigorous set 
of hypotheses in a proposal. Michael 
Stonebraker, in his paper, notes a 
number of challenges we faced and 
the lessons learned. One of the issues 
that kept surfacing was the "grease 
and glue" for the infrastructure, that 
is, the interoperability of pieces of 
software and hardware that composed 
the end-to-end system. This remains 
a challenge that needs research if we 
are going to achiev e the promised 
goals of internetworking. Another 
sticky point was scalability. On the 
one hand, it is difficult to build a very 
large networked system from scratch. 
On the other hand, as we slowly built 
the mass storage system to the point 
of minimal critical mass, we found 
that the current off-the-shelf tech- 
nologies for mass storage were not 
readv to be put use for our purposes. 
So, ves, we believe the project was 
w orthw hile with some cav eats. We 
gained critical know ledge about the 
technology, and we also came a long 
wav in learning the art of directing 
and leading the type of project that is 



necessary to assist the Information 
Technology industry in its quest 
for the ubiquitous distributed 
information system. 

How else are we going to get 
insight into the critical issues of build- 
ing and reliably operating a robust 
information infrastructure without 
building a large test bed with real end 
users whose needs push the state of 
the art at each point along the way? 
We believe that large projects similar 
to Sequoia are crucial. The papers 
that follow attest to the important 
know ledge gained. We have focused 
specifically on the end-to-end system 

from the scientists' desktops to the 
mass storage system, the challenge 
of building and using a large data 
repository, the timely and fast move- 
ment of verv large objects ov er the 
network, and browsing and visualiz- 
ing data from networked sources. 
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for Distributed- 
memory Systems 



Jonathan Harris 
John A. Bircsak 
M. Regina Bolduc 
Jill Ann Diewald 
Israel Gale 
Neil W. Johnson 
Shin Lee 

C. Alexander Nelson 
Carl D. Offner 



Digital's DEC Fortran 90 compiler implements 
most of High Performance Fortran version 1.1, 
a language for writing parallel programs. The 
compiler generates code for distributed-memory 
machines consisting of interconnected work- 
stations or servers powered by Digital's Alpha 
microprocessors. The DEC Fortran 90 compiler 
efficiently implements the features of Fortran 90 
and H PF that support parallelism. H PF programs 
compiled with Digital's compiler yield perfor- 
mance that scales linearly or even superlinearly 
on significant applications on both distributed- 
memory and shared-memory architectures. 



High Performance Fortran (HPF) is a new program- 
ming language for writing parallel programs. It is 
based on the Fortran 90 language, with extensions 
that enable the programmer to specify how array oper- 
ations can be divided among multiple processors for 
increased performance. In HPF, the program specifies 
only the pattern in which the data is divided among 
the processors; the compiler automates the low-level 
details of synchronization and communication of data 
between processors. 

Digital's DEC Fortran 90 compiler is the first imple- 
mentation of the full HPF version 1.1 language 
(except for transcriptive argument passing, dynamic 
remapping, and nested FORALL and WHERE con- 
structs). The compiler was designed for a distributed- 
memory machine made up of a cluster (or farm) of 
workstations and/or servers powered by Digital's 
Alpha microprocessors. 

In a distributed-memory machine, communication 
between processors must be kept to an absolute mini- 
mum, because communication across the network is 
enormously more time-consuming than any operation 
done locally. Digital's DEC Fortran 90 compiler 
includes a number of optimizations to minimize the 
cost of communication between processors. 

This paper briefly reviews the features of Fortran 90 
and HPF that support parallelism, describes how the 
compiler implements these features efficiently, and 
concludes with some recent performance results 
showing that HPF programs compiled with Digital's 
compiler yield performance that scales linearly or even 
superlinearly on significant applications on both 
distributed-memory and shared-memory architectures. 

Historical Background 

The desire to write parallel programs dates back to the 
1 950s, at least, and probably earlier. The mathematician 
John von Neumann, credited with the invention of the 
basic architecture of today's serial computers, also 
invented cellular automata, the precursor of today's 
massively parallel machines. The continuing motiva- 
tion for parallelism is provided by the need to solve 
computationally intense problems in a reasonable time 
and at an affordable price. Today's parallel machines, 
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which range from collections of workstations con- 
nected by standard fiber-optic networks to tightly cou- 
pled CPUs with custom high-speed interconnection 
networks, are cheaper than single-processor systems 
with equivalent performance. In many cases, equiva- 
lent single- processor systems do not exist and could 
not be constructed with existing technology. 

Historically, one of the difficulties with parallel 
machines has been writing parallel programs. The work 
of parallelizing a program was far from the original sci- 
ence being explored; it required programmers to keep 
track of a great deal of information unrelated to the 
actual computations; and it was done using ad hoc 
methods that were not portable to other machines. 

The experience gained from this work, however, led 
to a consensus on a better way to write portable 
Fortran programs that would perform well on a varietv 
of parallel machines. The High Performance Fortran 
Forum, an international consortium of more than 
100 commercial parallel machine users, academics, 
and computer vendors, captured and refined these 
ideas, producing the language now known as High 
Performance Fortran. 1 3 HPF programming systems 
are now being developed by most vendors of parallel 
machines and software. H PF is included as part of the 
DEC Fortran 90 language. 1 

One obvious and reasonable question is: Why 
invent a new language rather than have compilers 
automatically generate parallel coder The answer is 
straightforward: it is generally conceded that auto- 
matic parallelization technology is not yet sufficiently 
advanced. Although parallelization for particular archi- 
tectures (e.g., vector machines and shared-memory 
multiprocessors) has been successful, it is not fullv 
automatic but requires substantial assistance from the 
programmer to obtain good performance. That assis- 
tance usually comes in the form of hints to the compiler 
and rewritten sections of code that are more parallcliz- 
able. These hints, and in some cases the rewritten code, 
are not usually portable to other architectures or com- 
pilers. Agreement was widespread at the HPF Forum 
that a set of hints could be standardized and done in a 
portable way. Automatic parallelization technology is 
an active field of research, consequently, it is expected 
that compilers will become increasingly adept." 13 Thus, 
these hints are cast as comments — called compiler 
directives — in the source code. HPF actually contains 
very little new language beyond this; it consists primar- 
ily of these compiler directives. 

The HPF language was shaped by certain kev 
considerations in parallel programming: 

■ The need to identify computations that can be 
done in parallel 

■ The need to minimize communication between 
processors on machines with nonuniform memory 
access costs 



■ The need to keep processors as busy as possible by 
balancing the computation load across processors 

It is not always obvious which computations in 
a Fortran program are parallelizable. Although some 
DO loops express parallelizable computations, other 
DO loops express computations in which later itera- 
tions of the loop require the results of earlier itera- 
tions. This forces the computation to be done in order 
(serially), rather than simultaneously (in parallel). 
Also, whether or not a computation is parallelizable 
sometimes depends on user data that may vary from 
run to run of the program. Accordingly, HPF contains 
a new statement (FORALI.) for describing parallel 
computations, and a new directive (INDKPKNDENT) 
to identify' additional parallel computations to the 
compiler. These features are equally useful for distrib- 
uted- or shared-memory machines. 

HPF's data distribution directives are particularly 
important for disrributed-mcmory machines. The 
HPF directives were designed primarily to increase 
performance on "computers with nonuniform mem- 
ory access costs."' Of all parallel architectures, distrib- 
uted memory is the architecture in which the location 
of data has the greatest effect on access cost. On 
distributed-memorv machines, interprocessor com- 
munication is verv expensive compared to the cost of 
fetching local data, typically by several orders of mag- 
nitude. Thus the effect of suboptimal distribution of 
data across processors can be catastrophic. HPF direc- 
tives tell the compiler how to distribute data across 
processors; based on knowledge of the algorithm, pro- 
grammers choose directives that will minimize com- 
munication time. These directives can also help 
achieve good load balance: by spreading data appro- 
priately across processors, the computations on those 
data will also be spread across processors. 

Finally, a number of idioms that arc important in 
parallel programming either are awkward to express in 
Fortran or are greatly dependent on machine architec- 
ture for their efficient implementation. To be useful in 
a portable language, these idioms must be easy to 
express and implement efficiently. HPF has captured 
some of these idioms as library routines for efficient 
implementation on very different architectures. 

For example, consider the Fortran 77 program in 
Figure 1, which repeatedly replaces each element of 
a two-dimensional array with the average of its north, 
south, east, and west neighbors. This kind of compu- 
tation arises in a number of programs, including itera- 
tive solvers for partial differential equations and 
image-filtering applications. Figure 2 shows how this 
code can be expressed in HPF. 

On a machine with four processors, a single HPF 
directive causes the array A to be distributed across 
the processors as shown in Figure 3. The program 
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integer n, n u m b e r_o f _i t e r a t i o n s , i , j , k 




parameter(n=16) 




real A(n,n), Temp(n,n) 




... (Initialize A, numbe r_o f _i t e r a t i on s ) ... 


c 


do k = 1 , n u m b e r_o f _i t e r a t i on s 




Update non-edge elements only 




do i =2, n-1 




do j =2 , n-1 




TempCi, j)=(A(i, j-1)+A(i, j+1)+A(i+1, j)+A(i-1, j ) ) * 0 . 2 5 




enddo 




e n d d o 




do i=2, n-1 




do j =2, n-1 




A(i, j ) =Temp C i , j ) 




enddo 




enddo 




enddo 



Figure 1 

A Computation Expressed in Fortran 77 





integer n, n u m b e r_o f _i t e ra t i o n s , i, j, k 






parameter ( n= 1 6 ) 






real A(n, n) 




! hpf $ 


distribute ACblock, block) 






...(Initialize A, numbe r_of_i t e ra t i on s ) . 






do k = 1, number_of_i tera t ions 






forall ( i = 2 : n - 1 , j = 2 : n - 1 ) '.Update non- 


edge elements only 




A(i, j)=(A(i, j-1)+A(i, j+1)+A(i+1, 


j ) + A( i -1 , j ) )*0 . 25 




endfora I I 






enddo 





executes in parallel on the four processors, with each 
processor performing the updates to the array ele- 
ments it owns. This update, however, requires inter- 
processor communication (or "data motion"). To 
compute a new value for A(8, 2), which lives on 
processor 0, the value of A(9, 2), which lives on 
processor 1 , is needed. In fact, processor 0 requires the 
seven values A(9, 2), A(9, 3), ... A{9, 8) from proces- 
sor 1, and the seven values 4(2, 9), A{ 3, 9), ... 4(8, 9) 
from processor 2." Each processor, then, needs seven 
values apiece from two neighbors. By knowing the lay- 
out of the data and the computation being performed, 
the compiler can automatically generate the inter- 
processor communication instructions needed to exe- 
cute the code. 

Even for seemingly simple cases, the communica- 
tion instructions can be complex. Figure 4 shows the 
communication instructions that are generated for the 
code that implements the FORALL statement for a 
distributed-memory parallel processor. 



Figure 2 

The Same Computation Expressed in HPF 
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Figure 3 

An Array Distributed over Four Processors 
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Processor 0 


Processor 1 


Processor 2 


Processor 3 


SEND 

/4(8, 2)... ,4(8, 8) 
to Processor 1 


SEND 

/t(9, 2).../l(9, 8) 
to Processor 0 


SEND 

/4(2, 9).../l(8, 9) 
to Processor 0 


SEND 

A(9, 9),..A{ 15, 9) 
to Processor 1 


SEND 

A(2, 8), ,/l(8, S) 
to Processor 2 


SEND 

/4(9, 8).../l(l 5, 8) 
to Processor 3 


SEND 

/l(8, 9) /I(S, J 5) 
to Processor 3 


SEND 

/1(9, 9). ..,4(9, 9) 
to Processor 2 


RECEIVE 
/1(9, 2).../l(9, 8) 
from Processor 1 


RECEIVE 

A(S, 2).../J(8, 8) 

from Processor 0 


RECEIVE 
/l(2, 8).. /4(8, 8) 
from Processor 0 


RECEIVE 

A(9, S)...A( 15, 8) 

from Processor 1 


RECEIVE 
/l(2,9).../l(8,9) 
from Processor 2 


RECEIVE 

/I(9, 9).../l(15, 9) 

from Processor 3 


RECEIVE 

/1(9, 9).../I(9, 15) 

from Processor 3 


RECEIVE 
/1(8,9)...,4(8, 15) 
from Processor 2 



Figure 4 

Compiler-generated Communication for a FORALL Statement 



Although the communication needed in this sim- 
ple example is not difficult to figure out by hand, 
keeping track of the communication needed for 
higher-dimensional arrays, distributed onto more 
processors, with more complicated computations, can 
be a very difficult, bug-prone task. In addition, a num- 
ber of the optimizations that can be performed would 
be extremely tedious to figure out by hand. Never- 
theless, distributed-memory parallel processors are 
programmed almost exclusively today by writing pro- 
grams that contain explicit hand-generated calls to the 
SEiND and RECEIVE communication routines. The 
difference between this kind of programming and pro- 
gramming in HPF is comparable to the difference 
between assembly language programming and high- 
level language programming. 

This paper continues with an overview of the HPF 
language, a discussion of the machine architecture tar- 
geted by the compiler, the architecture of the compiler 
itself, and a discussion of some optimizations per- 
formed by its components. It concludes with recent 
performance results, showing that HPE programs 
compiled with Digital's compiler scale linearly in sig- 
nificant cases. 

Overview of the High Performance 
Fortran Language 

High Performance Fortran consists of a small set of 
extensions to Fortran 90. It is a data-parallel program- 
ming language, meaning that parallelism is made pos- 
sible by the explicit distribution of large arrays of data 
across processors, as opposed to a control-parallel 



language, in which threads of computation are distrib- 
uted. Like the standard Fortran 77, Fortran 90, and C 
models, the HPF programming model contains a sin- 
gle thread of control; the language itself has no notion 
of process or thread. 

Conceptually, the program executes on all the 
processors simultaneously. Since each processor con- 
tains only a subset of the distributed data, occasionally 
a processor may need to access data stored in the 
memory of another processor. The compiler deter- 
mines the actual details of the interprocessor commu- 
nication needed to support this access; that is, rather 
than being specified explicitly, the details are implicit 
in the program. 

The compiler translates HPF programs into low- 
level code that contains explicit calls to SEND and 
RECEIVE message-passing routines. All addresses in 
this translated code arc modified so that they refer to 
data local to a processor. As part of this translation, 
addressing expressions and loop bounds become 
expressions involving the processor number on which 
the code is executing. Thus, the compiler needs to gen- 
erate only one program: the generated code is parame- 
trized by the processor number and so can be executed 
on all processors with appropriate results on each 
processor. This generated code is called explicit single- 
program multiple-data code, or explicit-SPMD code. 

In some cases, the programmer may find it useful 
to write cxplicit-SPMD code at the source code level. 
To accommodate this, the HPF language includes an 
escape hatch called EXTRINSIC procedures that is 
used to leave data-parallel mode and enter explicit- 
SPMD mode. 
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We now describe some of the HPF language exten- 
sions used to manage parallel data. 

Distributing Data over Processors 

Data is distributed over processors by the 
DISTRIBUTE directive, the ALIGN directive, or 
the default distribution. 

The DISTRIBUTE Directive For parallel execution of 
array operations, each array must be divided in mem- 
ory, with each processor storing some portion of 
the array in its own local memory. Dividing the array 
into parts is known as distributing the array. The HPF 
DISTRIBUTE directive controls the distribution of 
arrays across each processor's local memory. It does 
this bv specifying a mapping pattern of data objects 
onto processors. Many mappings are possible; we illus- 
trate only a few. 

Consider first the case of a 16 X 16 array A in an 
environment with four processors. One possible speci- 
fication for A is 

real A(16, 16) 
!hpf$ distribute A ( * , block) 

The asterisk (*) for the first dimension of A means 
that the array elements are not distributed along 
the first (vertical) axis. In other words, the elements 
in any given column are not divided among differ- 
ent processors, but are assigned as a single block to 
one processor. This type of mapping is referred to as 
serial distribution. Figure 5 illustrates this distribution. 

The BLOCK keyword for the second dimension 
means that for any given row, the array elements are 
distributed over each processor in large blocks. The 
blocks are of approximatelv equal size — in this case, 
they are exactlv equal — with each processor holding 
one block. As a result, A is broken into four contigu- 
ous groups of columns, with each group assigned to 
a separate processor. 

Another possibility is a (*, CYCLIC) distribution. 
As in (*, BLOCK), all the elements in each column are 
assigned to one processor. The elements in any given 
row, however, are dealt out to the processors in round- 
robin order, like playing cards dealt out to plavers 
around a table. When elements are distributed over n 
processors, each processor contains every nih column, 
starting from a different offset. Figure 6 shows the 
same array and processor arrangement, distributed 
CYCLIC instead of BLOCK. 

As these examples indicate, the distributions of the 
separate dimensions are independent. 

A (BLOCK, BLOCK) distribution, as in Figure 3, 
divides the array into large rectangles. In that figure, 
the array elements in any given column or any given 
row are divided into two large blocks: Processor 0 gets 
A(\ :8, 1:8), processor 1 gets .4(9:16,1:8), processor 2 
gets A{ 1:8, 9:16), and processor 3 gets ^(9:16,9:16). 



Figure 5 

A(*, BLOCK) Distribution 



0123012301230123 



Figure 6 

A(*, CYCLIC) Distribution 



The ALIGN Directive The ALIGN directive is used to 
specif)' the mapping of arrays relative to one another. 
Corresponding elements in aligned arrays are always 
mapped to the same processor; array operations 
between aligned arrays are in most cases more efficient 
than array operations between arravs that are not 
known to be aligned. 

The most common use of ALIGN is to specify that 
the corresponding elements of two or more arrays be 
mapped identically, as in the following example: 
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! h p f S align A(i) with B(i) 

This example specifies that the two arrays A and Bare 
always mapped the same way. More complex align- 
ments can also be specified. For example: 

! hpf $ align E(i) with F ( 2* i -1 ) 

In this example, the elements of E are aligned with the 
odd elements of F. In this case, Fcan have at most half 
as many elements as F. 

An array can be aligned with the interior of a larger 
array: 

real A(12, 12) 
real BU6, 16) 
!hpf$ align A(i, j) with B(i+2, j+2) 

In this example, the 12 X 12 array A is aligned with 
the interior of the 16 X 16 array .6 (see Figure 7). Each 
interior element of B is always stored on the same 
processor as the corresponding element of A 

The Default Distribution Variables that are not explic- 
itly distributed or aligned are given a default distribu- 
tion by the compiler. The default distribution is not 
specified by the language: different compilers can 
choose different default distributions, usually based 
on constraints of the target architecture. In the DEC 
Fortran 90 language, an array or scalar with the default 
distribution is completely replicated. This decision was 
made because the large arrays in the program are the 
significant ones that the programmer has to distribute 
explicitly to get good performance. Any other arrays 
or scalars will be small and generally will benefit from 
being replicated since their values will then be available 
everywhere. Of course, the programmer retains com- 
plete control and can specify a different distribution 
for these arrays. 

Replicated data is cheap to read but generally 
expensive to write. Programmers typically use repli- 
cated data for information that is computed infre- 
quently but used often. 



B 



A 



Figure 7 

An Example of Array Alignment 
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Data Mapping and Procedure Calls 

The distribution of arrays across processors introduces 
a new complication for procedure calls: the interface 
between the procedure and the calling program must 
take into account not only the type and size of the rel- 
evant objects but also their mapping across processors. 
The HPF language includes special forms of the 
ALIGN and DISTRIBUTE directives for procedure 
interfaces. These allow die program to specify whether 
array arguments can be handled by the procedure as 
they are currently distributed, or whether (and how) 
they need to be redistributed across the processors. 

Expressing Parallel Computations 

Parallel computations in HPF can be identified in four 
ways: 

■ Fortran 90 array assignments 

■ FORALL statements 

■ The INDEPENDENT directive, applied to DO 
loops and FORALL statements 

■ Fortran 90 and HPF intrinsics and library functions 

In addition, a compiler may be able to discover paral- 
lelism in other constructs. In this section, we discuss 
the first two of these parallel constructions. 

Fortran 90 Array Assignment In Fortran 77, operations 
on whole arrays can be accomplished only through 
explicit DO loops that access array elements one at a 
time. Fortran 90 array assignment statements allow 
operations on entire arrays to be expressed more simply. 

In Fortran 90, the usual intrinsic operations for 
scalars (arithmetic, comparison, and logical) can be 
applied to arrays, provided the arrays arc of the same 
shape. For example, if A, B, and Care two-dimensional 
arrays of the same shape, the statement C = A + B 
assigns to each element of C a value equal to the sum 
of the corresponding elements of A and B. 

In more complex cases, this assignment syntax can 
have the effect of drastically simplifying the code. For 
instance, consider the case of three-dimensional 
arrays, such as the arrays dimensioned in the following 
declaration: 

real D(10, 5:24, -5:M), E(0:9, 20, M+6) 

In Fortran 77 syntax, an assignment to every ele- 
ment of D requires triple-nested loops such as the 
example shown in Figure 8. 

In Fortran 90, this code can be expressed in a single 
line: 

D = 2.5*0+E+2.0 

The FORALL Statement The FORALL statement is an 
HPF extension to the American National Standards 
Institute (ANSI) Fortran 9# standard but has been 
included in the draft Fortran 95 standard. 



do i = 1, 10 
do j = 5, 24 
do k = -5, H 

D(i, j, k) = 2 . 5 * D ( i , j, k) + E < i -1 , j-4, k + 6) + 2.0 
end do 
end do 
end do 



Figure 8 

An Example ofa Triple- nested Loop 

FORALL is a generalized form of Fortran 90 array 
assignment syntax that allows a wider variety of array 
assignments to be expressed. For example, the diago- 
nal of an array cannot be represented as a single 
Fortran 90 array section . Therefore, the assignment of 
a value to every element of the diagonal cannot be 
expressed in a single array assignment statement. It. 
can be expressed in a FORALL statement: 

real, d i m e n s i o n ( n , n) :: A 
forall (i = 1:n) A ( i , i) = 1 

Although FORALL structures serve the same pur- 
pose as some DO loops do in Fortran 77, a FORALL 
structure is a parallel assignment statement, not a 
loop, and in many cases produces a different result 
from an analogous DO loop. For example, the 
FORALL statement 

forall (i = 2:5) C(i, i) = C(i-1, 
applied to the matrix 
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44 
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produces the following result: 
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44 



On the other hand, the apparently similar DO loop 

do i =2, 5 

C(i, i ) = C( i-1, i-1 ) 
end do 

produces 
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11 
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This happens because the DO loop iterations are per- 
formed sequentially, so that each successive element of 
the diagonal is updated before it is used in the next 
iteration. In contrast, in the FORALL statement, all 
the diagonal elements are fetched and used before any 
stores happen. 

The Target Machine 

Digital's DEC Fortran 90 compiler generates code 
for clusters of Alpha processors running the Digital 
UNIX operating system. These clusters can be separate 
Alpha workstations or servers connected by a fiber dis- 
tributed data interface (FDDI) or other network 
devices. (Digital's high-speed GIGAswitch/FDDI sys- 
tem is particularly appropriate. 14 ) A shared-memory, 
symmetric multiprocessing (SMP) system like the 
AlphaServer 8400 system can also be used. In the case 
of an SMP system, the message-passing library uses 
shared memory as the message-passing medium; the 
generated code is otherwise identical. The same exe- 
cutable can run on a distributed- memory cluster or an 
SMP shared -memory cluster without recompiling. 
DEC Fortran 90 programs use the execution envi- 
ronment provided by Digital's Parallel Software 
Environment (PSE), a companion product. U5 PSE 
is responsible for invoking the program on multiple 
processors and for performing the message passing 
requested by the generated code. 

The Architecture of the Compiler 

Figure 9 illustrates the high-level architecture of 
the compiler. The curved path is the path taken 
when compiler command-line switches are set for 
compiling programs that will not execute in parallel, 
or when the scoping unit being compiled is declared 
as EXTRJNSIC(HPF_LOCAL). 

Figure 9 shows the front end, transform, middle 
end, and GEM back end components of the compiler. 
These components function in the following ways: 

■ The front end parses the input code and produces 
an internal representation containing an abstract 
syntax tree and a symbol table. It performs exten- 
sive semantic checking. 1 " 
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Figure 9 

Compiler Components 



■ The transform component performs the transfor- 
mation from global-HPF to explicit-SPMD form. 
To do this, it localizes the addressing of data, inserts 
communication where necessary, and distributes 
parallel computations over processors. 

■ The middle end translates the internal representa- 
tion into another form of internal representation 
suitable for GEM. 

■ The GEM back end, also used by other Digital 
compilers, performs local and global optimization, 
storage allocation, code generation, register alloca- 
tion, and emits binary object code. 17 

In this paper, we are mainly concerned with the 
transform component of the compiler. 

An Overview of Transform 

Figure 19 shows the transform phases discussed in this 
paper. These phases perform the following key tasks: 

■ LOWER. Transforms array assignments so that 
they look internally like FORALL statements. 

■ DATA. Fills in the data space information for each 
symbol using information from HPF directives 
where available. This determines where each data 
object lives, i.e., how it is distributed over the 
processors. 

■ ITER. Fills in the iteration space information for 
each computational expression node. This deter- 
mines where each computation takes place and 
indicates where communication is necessary. 

■ AUG. Pulls functions in the interior of expressions 
up to the statement level. It also compares the map- 
ping of actual arguments to that of their corre- 
sponding dummies and generates remapping into 
compiler-generated temporaries if necessary. 
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DIVIDE 
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■ DIVIDE. Pulls all communication inside expres- 
sions (identified by ITER) up to the statement level 
and identifies what kind of communication is 
needed. It also ensures that information needed for 
flow of control is available at each processor. 

■ STRIP. Turns global-HPF code into explicit-SPMD 
code bv localizing the addressing of all data objects 
and inserting explicit SEND and RECEIVE calls 
to make communication explicit. In the process, 
it performs strip mining and loop optimizations, 
vectorizes communication, and optimizes nearest- 
neighbor computations. 

Transform uses the following main data structures: 

■ Symbol table. This is the symbol table created by 
the front end. It is extended by the transform phase 
to include dope information for array and scalar 
symbols. 

■ Dotree. Transform uses the dotree form of the 
abstract syntax tree as an internal representation of 
the program. 

■ Dependence graph. This is a graph whose nodes are 
expression nodes in the dotree and whose edges 
represent dependence edges. 

■ Data spaces. A data space is associated with each 
data symbol (i.e., each arrav and each scalar). The 
data space information describes how each data 
object is distributed over the processors. This infor- 
mation is derived from HPF directives. 

■ Iteration spaces. An iteration space is associated 
with each computational node in the dotree. The 
iteration space information describes how compu- 
tations are distributed over the processors. This 
information is not specified in the source code but 
is produced bv the compiler. 

The interrelationship of these data structures is dis- 
cussed in Reference 18. The data and iteration spaces 
arc central to the processing performed by transform. 

The Transform Phases 

LOWER 

Since the FORALL statement is a generalization of a 
Fortran 90 arrav assignment and includes it as a special 
case, it is convenient for the compiler to have a uni- 
form representation for these two constructions. The 



Figure 10 

The Transform Phases 
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LOW Hll phase implements this by turning each 
h'ortr.m 90 array assignment into an equivalent 
FORALL statement (actually, into the dorrce repre- 
sentation of one). This uniform representation means 
that the compiler has far fewer special cases to consider 
than otherw ise might be necessary and leads to no 
degradation of the generated code. 

DATA 

The DATA phase specifies where data lives. Placing 
and addressing data correctly is one of the major tasks 
of transform. There area large number of possibilities: 

When a value is available on everv processor, it is 
said to be ixplicalccl. When it is available on more than 
one but not all processors, it is said to be /icirlieilly 
ixplicalccl. For instance, a scalar mav live on onlv one 
processor, or on more than one processor. Tvpicallv, a 
scalar is replicated — it lives on all processors. The repli- 
cation of scalar data makes fetches cheap because each 
processor has a copy of the requested value. Stores to 
replicated scalar data can be expensive, however, if the 
value to be stored has not been replicated. In that case, 
the value to be stored must be sent to each processor. 

The same consideration applies to arrays. Arrays 
may be replicated, in which case each processor has a 
copy of an entire array; or arrays may be partially repli- 
cated, in which case each element of the array is avail- 
able on a subset ofthe processors. 

Furthermore, arrays that arc not replicated mav be 
distributed across the processors in several different 
fashions, as explained above. In fact, each dimension 
of each arrav may be distributed independently of 
the other dimensions. The HPF mapping directives, 
principally ALIGN and DISTRIBUTE, give the pro- 
grammer the ability to specify completely how each 
dimension of each array is laid out. DATA uses the 
information in these directives to construct an internal 
description or data space of the lavout of each array. 

ITER 

The 1TKR phase determines where the intermediate 
results of calculations should live. Its relationship to 
DATA can be expressed as: 

■ DATA decides where parallel data lives. 

■ ITHll decides where parallel computations happen. 

Each array has a fixed number of dimensions and an 
extent in each of those dimensions; these properties 
together determine the shape of an array. After DATA 
has finished processing, the shape and mapping of 
each arrav is known. Similarly, the result ofa computa- 
tion has a particular shape and mapping. This shape 
may be different from that ofthe data used in the com- 
putation. As a simple example, the computation 

A( : , : ,3) + B( : , : ,3) 



has a two-dimensional shape, even though both arrays 
A and B have three-dimensional shapes. The data 
space data structure is used to describe the shape of 
each array and its layout in memory and across proces- 
sors; similarly, iteration space is used to describe the 
shape of each computation and its layout across 
processors. One of the main tasks of transform is to 
construct the iteration space for each computation so 
that it leads to as little interproeessor communication 
as possible: this construction happens in ITER. The 
compiler's view of this construction and the interac- 
tion of these spaces arc explained in Reference 18. 

Shapes can change wirhin an expression: while some 
operators return a result having the shape of their 
operands (e.g., adding two arrays ofthe same shape 
returns an array ofthe same shape), other operators 
can return a result having a different shape than the 
shape of their operands. For example, reductions like 
SUM return a result having a shape with lower rank 
than that ofthe input expression being reduced. 

One well-known method of determining where 
computations happen is the "owner-computes" rule. 
With this method, all the values needed to construct 
the computation on the right-hand side of an assign- 
ment statement are fetched (using interproeessor 
communication if necessary) and computed on the 
processor that contains the left-hand-side location. 
Then they are stored to that left-hand-side location (on 
the same processor on which they were computed). 
Thus a description of where computations occur is 
derived from the output of DA TA. There are, however, 
simple examples where this method leads to less than 
optimal performance. For instance, in the code 

real A(n, n), B(n, n), C(n, n) 
!hpf$ distribute A ( b I o c k , block) 
!hpf$ distribute BCcycLic, cyclic) 
!hpf$ distribute C ( c y c I i c , cyclic) 

forall ( i =1 : n , j = 1 : n ) 

A ( i , j) = B(i, j) + C(i, j) 
end forall 

the owner-computes rule would move B and C to 
align with A, and then add the moved values of B and 
Cand assign to A.. It is certainly more efficient, how- 
ever, to add B and C together where they are aligned 
with each other and then communicate the result to 
where it needs to be stored to A. With this procedure, 
we need to communicate only one set of values rather 
than two. The compiler identifies cases such as these 
and generates the computation, as indicated here, to 
minimize the communication. 

ARG 

The ARG phase performs any necessary remapping of 
actual arguments at subroutine call sites. It does this 
by comparing the mapping ofthe actuals (as deter- 
mined by ITER) to the mapping ofthe corresponding 
dummies (as determined by DATA). 
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In our implementation, the caller performs all 
remapping. If remapping is necessary, AUG exposes 
that remapping by inserting an assignment statement 
that remaps the actual to a temporary that is mapped 
the way the dummy is mapped. This guarantees that 
references to a dummy w ill access the correct data as 
specified by the programmer. Of course, if the parame- 
ter is an OUT argument, a similar copy-out remapping 
has to be inserted after the subroutine call. 

DIVIDE 

The DIVIDE phase partitions ("divides") each expres- 
sion in the dotree into regions. Each region contains 
computations that e. in happen without interprocessor 
communication. When region R uses the values of 
a subexpression computed in region S, for example, 
interprocessor communication is required to remap 
the computed values from their locations in S to their 
desired locations in R. DIVIDE makes a temporary 
mapped the way region R needs it and makes an 
explicit assignment statement whose left-hand side 
is that temporary and whose right-hand side is the 
subexpression computed in region S. In this way, 
DIVIDE makes explicit the interprocessor communi- 
cation that is implicit in the iteration space information 
attached to each expression node. 

DIVIDE also performs other processing: 

■ DIVIDE replicates expressions needed to manage 
control flow, such as an expression representing 
a bound of a DO loop or the condition in an IF 
statement. Consequently, each processor can do 
the necessary branching. 

■ For each statement requiring communication, 
DIVIDE identifies the kind of communication 
needed. 

Depending on what knowledge the two sides of the 
communication (i.e., the sender and the receiver) 
have, we distinguish two kinds of communication: 

- Full knowledge. The sender knows what it is 
sending and to whom, and the receiver knows 
what it is receiving and from whom. 

- Partial knowledge. Either the sender knows 
what it is sending and to whom, or the receiver 
knows what it is receiv ing and from whom, but 
the other parry knows nothing. 

This kind of message is typical of code dealing 
with irregular data accesses, for instance, code 
with array references containing vector-valued 
subscripts. 

STRIP 

The STRIP phase (shortened from "strip miner"; 
probably a better term would be the "localizer") rakes 
the statements categorized by DIVIDE as needing 



communication and inserts calls to library routines to 
move the data from where it is to where it needs to be. 

It then localizes parallel assignments coming from 
vector assignments and FORALL constructs. In other 
words, each processor has some (possibly zero) num- 
ber of array locations that must be stored to. A set of 
loops is generated that calculates the value to be stored 
and stores it. The bounds for these loops are depen- 
dent on the distribution of the array being assigned to 
and the section of the array being assigned to. These 
bounds mav be explicit numbers known at compile 
time, or rhev mav be expressions (when the array size 
is nor known at compile time). In any case, they are 
exposed so that they may be optimized by later phases. 
They are nor calls to run-rime routines. 

The subscripts of each dimension of each array in 
the statement are then rewritten in terms of the loop 
variable. This modification effectively turns the origi- 
nal global subscript into a local subscript. Scalar sub- 
scripts are also converted to local subscripts, but in this 
case the subscript expression does nor involve loop 
indices. Similarly, scalar assignments that reference 
array elements have their subscripts converted from 
global addressing to local addressing, based on the 
original subscript and the distribution of the corre- 
sponding dimension of the array. They do not require 
strip loops. For example, consider rhe code fragment 
shown in Figure 1 la. 

Here k is some variable whose value has been 
assigned before the FORALL. Let us assume that A 
and B have been distributed over a 4 X 5 processor 
array in such a way that the first dimensions of A and B 
are distributed CYCLIC over the first dimension of the 
processor array (which has extent 4), and the second 
dimensions of A and B are distributed BLOCK over 
the second dimension of rhe processor array (which 
has extent 5). (The programmer can express this 
through a facility in HPF.) The generated code is 
shown in Figure 1 1 b. 

If rhe array assigned ro on the left-hand side of such 
a statement is also referenced on rhe right-hand side, 
then replacing rhe parallel FORALL by a DO loop 
mas' violate the "fetch before store" semantics of rhe 
original statement. That is, an array element may be 
assigned to on one iteration of rhe DO loop, and this 
new value mav subsequently be read on a later itera- 
tion. In rhe original meaning of rhe statement, how- 
ever, all values read would be the original values. 

This problem can always be resolved by evaluating 
the right-hand side of rhe statement in its entirety into 
a temporary array, and then — in a second set of DO 
loops — assigning that temporal')' to the left-hand side. 
We use dependence analysis to determine if such a 
problem occurs at all. Even if it does, there are cases in 
which loop transformations can be used ro eliminate 
the need lor a temporary, as outlined in Reference 19. 
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real AC100, 20), BC100, 20) 
!hpf$ distribute A(cyclic, block), BCcyclic, block) 

f ora I I (i = 2:99) 

AC i , k) = B(i, k) 
end forall 



(a) Code Fragment 



m = my_p r o c e s s o r ( ) 

if k mod 5 = Lm/4J then 

do i = (if m mod 4 = 0 then 2 else 1), (if m mod 4 = 3 then 24 else 25) 
A(i, Lk/5J) = B(i, Lk/5J) 

end do 
end i f 



(b) Pseudocode Generated for Code Fragment 



Figure 11 

Code Fragment and Pseudocode Generated for Code Fragment 



( Sonic poor implementations alwavs introduce the 
temporary even when it is not needed. ) 

Unlike other Hl'F implementations, ours uses 
compiler-generated inlined expressions instead of 
function calls to determine local addressing values. 
Furthermore, our implementation does not introduce 
barrier synchronization, since the sends and receives 
generated bv the transform phase will enforce any 
necessary synchronization. In general, this is much less 
expensive than a naive insertion of barriers. The 
reason this works can be seen as follows: first, any value 
needed by a processor is computed either locally or 
nonlocally. If the value is computed locally, the normal 
control flow guarantees correct access order for that 
value. If the value is computed nonlocallv, the gener- 
ated receive on the processor that needs the value 
causes the receiving processor to wait until the value 
arrives from the sending processor. The sending 
processor will notsend the value until it has computed 
it, again because of normal control-flow. If the sending 
processor is ready to send data before the receiving 
processor is ready for it, the sending processor can 
continue without waiting for the data to be received. 
Digital's Parallel Software Environment (PSK) buffers 
the data until it is needed.' 5 

Some Optimizations Performed by the Compiler 

The GHM back end performs the following 
optimizations: 



■ Constant folding 

■ Optimizations of arithmetic IF, logical IF, and 
block IF-TH EN-ELSE 

■ Global common subexpression elimination 

■ Removal of invariant expressions from loops 

■ Global allocation of general registers across pro- 
gram units 

■ In-line expansion of statement functions and 
routines 

■ Optimization of array addressing in loops 

■ Value propagation 

■ Deletion of redundant and unreachable code 

■ Loop unrolling 

■ Software pipelining to rearrange instructions 
between different unrolled loop iterations 

■ Array teniporarv elimination 

In addition, the transform component performs 
some important optimizations, mainly devoted to 
improving interprocessor communication. We have 
implemented the following optimizations: 

Message Vectorization 

The compiler generates code to limit the communica- 
tion to one SEND and one RECEIVE for each array 
being moved between any two processors. This is the 
most obvious and basic of all the optimizations that a 
compiler can perform for distributed -memorv archi- 
tectures and has been vvidelv studied/""" 
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If the arrays A and B arc laid our as in Figure 1 2 and 
if li is to be assigned to ,4, then array elements Zi(4), 
/i(5), and /j(6), all of which live on processor 6, 
should be sent to processor 1 . Clearly, wc do not vv ant 
to generate three distinct messages for this. Therefore, 
we collect these three elements and generate one mes- 
sage containing all three of them. This example 
involves full knowledge. 

Communications involving partial knowledge are 
also vectorized, but thev are much more expensive 
because the side of the message without initial knowl- 
edge has to be informed of the message. Although 
there are sev eral ways to do this, all are costly, either in 
rime or in space. 

We use the same method, incidentally, to inline the 
HPF XXXSCATTER routines. These new routines 
have been introduced to handle a parallel construct 
that could cause more than one value to be assigned to 
the same location. The outcome of such cases is deter- 
mined by the routine being inlined. For instance, 
SL)M_SCATTER simply adds all the values that arrive 
at each location and assigns the final result to that loca- 
tion. Although this is an example of intcrproccssor 
communication with partial know ledge, we can still 
build up messages so that only a minimum number of 
messages are sent. 

In some cases, we can improve the handling of com- 
munications with partial knowledge, provided they 
occur more than once in a program. For more infor- 
mation, please see the section Run-time Preprocessing 
of Irregular Data Accesses. 

Strip Mining and Loop Optimizations 

Strip mining and loop optimizations have to do with 
generating ef ficient code on a per-processor basis, and 
so in some sense can be thought of as conventional. 
Generally, we follow the processing derailed in 
Reference 19 and summarized as: 

■ Strip mining obstacles are eliminated where possi- 
ble by loop transformations (loop reversal or loop 
interchange). 



■ Temporaries, if introduced, arc of minimal size; this 
is achieved by loop interchange. 

■ Exterior loop optimization is used to allow reused 
data to be kept in registers over consecutive itera- 
tions of the innermost loop. 

■ Loop fusion enables more efficient use of conven- 
tional optimizations and minimizes loop overhead. 

Nearest-neighbor Computations 

Nearest-neighbor computations are common in code 
written to discretize partial differential equations. See 
the example giv en in Figure 2. 

If we have, for example, 16 processors, with the array 
A distributed in a ( BLOCK, BLOCK) fashion over the 
processors, then conceptually, the array is distributed as 
in Figure 13, where the arrows indicate communica- 
tion needed between neighboring processors. In fact, 
in this case, each processor needs to see values only 
from a narrow strip (or "shadow edge") in the memory 
of its neighboring processors, as shown in Figure 14. 

The compiler identifies nearest-neighbor computa- 
tions (the user does not have to tag them), and it alters 
the addressing of each array involved in these compu- 
tations (throughout the compilation unit). As a result, 
each processor can store those array elements that are 
needed from the neighboring processors. Those array 
elements are moved in ( using message vectorization) 
at the beginning of the computation, after w hich the 
entire computation is local. 

Recognizing nearest-neighbor statements helps 
generate better code in several wavs: 

■ Less run-time overhead. The compiler can easily 
identify the exact small portion of the array 
that needs to be moved. The communication for 
nearest-neighbor assignments is extremely regular: 
At each step, each processor is sending an entire 
shadow edge to precisely one of its neighbors. 
Therefore the communication processing overhead 
is grearlv reduced. That is, we are able to generate 
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Two Arrays in Memorv 
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Figure 14 

Shadow Edges for a Nearest-neighbor Computation 



communication involving even less overhead than 
general communication involving full knowledge. 

■ No local copying. If shadow edges were not used, 
then the following standard processing would rake- 
place: For each shifted- array reference on the right- 
hand side of the assignment, shift the entire array; 
then identify that parr of the shifted array that lives 
locally on each processor and create a local tempo- 
rary to hold it. Some of that temporary (the parr 
representing our shadow edge) would he moved in 
from a neighboring processor, and the rest of the 
temporary would he copied locally from the origi- 
nal arrav. Our processing eliminates the need for 
the local temporary and for the local copy, which is 
substantial for large arrays. 

■ Greater locality of reference. When the actual com- 
putation is performed, greater locality of reference 
is achieved because the shadow edges (i.e., the 
received values) are now part of the array, rather 
than being a temporary somewhere else in memory. 

■ Fewer messages. Finally, the optimization also 
makes it possible for the compiler to see that some 
messages may be combined into one message, 
thereby reducing the number of messages that 
must be sent. For instance, if the right-hand side 
of the assignment statement in the above example 
also contained a term / + 1, / + 1 ), even though 
overlapping shadow edges and an additional 
shadow edge would now be in the diagonally adja- 
cent processor, no additional communication 
would need to be generated. 

Reductions 

The SUM intrinsic function of Fortran 90 takes an 
array argument and returns the sum of all. its elements. 
Alternatively, SUM can return an arrav whose rank is 
one less than the rank of its argument, and each of 
whose values is the sum of the elements in the argu- 
ment along a line parallel to a specified dimension. 



In either ease, the rank of the result is less than that of 
the argument; therefore, SUM is referred to as a 
reduction intrinsic. Fortran 90 includes a family of 
such reductions, and HPF adds more. 

We inline these reduction intrinsies in such a way 
as to distribute the work as much as possible across 
the processors and to minimize the number of mes- 
sages sent. 

In general, the reduction is performed in three basic 
steps: 

1 . Each processor locally performs the reduction oper- 
ation on its part ofthe reduction source into a buffer. 

2. These partial reduction results are combined with 
those of the other processors in a "logarithmic'" 
fashion (to reduce the number of messages sent). 

3. The accumulated result is then locally copied to the 
tai'get location. 

Figure 15 shows how the computations and com- 
munications occur in a complete reduction of an array 
distributed over four processors. In this figure, each 
vertical column represents the memory of a single 
processor. The processors are thought of (in this case) 
as being arranged in a 2 X 2 square; this is purely for 
conceptual purposes — the actual processors are typi- 
cal Iv connected through a switch. 

First, the reduction is performed locally in the 
memory of each processor. This is represented by the 
vertical arrows in the figure. Then the computations 
are accumulated over the four processors in two steps: 
the two parallel curved arrows indicate the inter- 
processor communication in the first step, follow ed by 
the communication indicated by the remaining curv ed 
arrow in the second step. Of course, for five to eight 
processors, three communication steps would be 
needed, and so on. 

Although this basic idea never changes, the actual 
generated code must take into account various factors. 
These include (1) whether the object being reduced 
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is replicated or distributed, (2) the different distri- 
butions that each array dimension might have, and 
(3) whether the reduction is complete or partial (i.e., 
with a DIM argument). 

Run - time Preprocessing o f Irregular Data A ccesses 

Run-time preprocessing of" irregular data accesses is 
a popular technique."' It an expression involving the 
same pattern of irregular data access is present more 
than once in a compilation unit, additional run-rime 
preprocessing can be used to good effect. An abstract 
example would be code of the form: 

cat I setup(u, V, W) 

do i = 1, n_t i me_s t e p s , 1 

do i = 1, n, 1 

A(V(i)) = A ( V ( i ) ) + B(WCi)) 

enddo 

do i = 1 , n, 1 

C(V(i)) = C(V< i ) ) + D(WCi)) 
enddo 

do i = 1 , n, 1 

E(V(i)) = E ( V ( i ) ) + F ( W ( i ) ) 
enddo 
enddo 

w hich could be written in Hl'F as: 

call setupCU, V, W) 

do i = 1, n_t i me_s teps, 1 

A = suni_scatter(B(W(1:n)), A, V(1:n)) 
C = sum_sca t t er( D ( W ( 1 : n ) ) , C, V(1:n)) 
E = sum_scatter( F(W(1 :n)), E, V(1:n)) 

enddo 

To the compiler, the significant thing about this 
code is that, the indirection vectors V and VV are con- 
stant over iterations of the loop. Therefore, the com- 
piler computes the source and target addresses of the 
data that has to be sent and receiv ed by each processor 
once at the top of the loop, thus paying this price one 
rime. Each such statement then becomes a communi- 
cation with full knowledge and is executed quite effi- 
ciently with message vecton/.ation. 

Other Communication Optimizations 

The processing needed to set up communication of 
array assignments is fairly expensive. For each element 
of source data on a processor, the value of the data and 
the target processor number are computed. For each 
target data on a processor, the source processor num- 
ber and the target memorv address are computed. The 
compiler and run time also need to sort out local data 
that do not involve communication, as well as to vec- 
torize the data that are to be communicated. 

We trv to optimize the communication processing 
bv analyzing the iteration space and data space of the 
arrav sections involved. Examples of the patterns of 
operations that we optimize include the following: 

■ Contiguous data. When the source or target local 
arrav section on each processor is in contiguous 
memorv addresses, the processing can be optimized 



to treat the section as a w hole, instead of comput- 
ing the value or memorv address of each element in 
the section. 

In general, arrav sections belong to this category 
if the last vector dimension is distributed BLOCK 
or CYCLIC and the prior dimensions (if any) arc- 
all serial. 

If the source and target arrav sections satisfy ev en 
more restricted constraints, the processing overhead 
may be further reduced. For example, arrav opera- 
tions that involve sending a contiguous section of 
BLOCK or CYCLIC distributed data to a single 
processor, or vice versa, belong to this category and 
result in very efficient communication processing. 

■ Unique source or rarger processor. When a proces- 
sor only sends data to a unique processor, or a pro- 
cessor only receives data from a unique processor, 
the processing can be optimized to use that unique 
processor number instead of computing the proces- 
sor number for each element in the section. This 
optimization also applies to target arrays that are 
fully replicated. 

■ Irregular data access. If all indirection vectors 
are fully replicated for an irregular data access, 
we can actually implement the arrav operation as 
a full- knowledge communication instead of a more 
expensive partial-knowledge communication . 

For example, the irregular data access statement 

A(V( : ) ) = B( : ) 

can be turned into a regular remapping statement if 
I is fullv replicated and A and /J are both distributed. 

Furthermore, if B is also fullv replicated, the state- 
ment is recognized as a local assignment, removing 
the communication processing ov erhead altogether. 

Performance 

In this section, we examine the performance of three 
HPF programs. One program applies the shallow- 
water equations, discretized using a finite difference 
scheme to a specific problem; another is a conjugate- 
gradient solver for the Poisson equation, and the 
third is a three-dimensional finite difference solver. 
These programs are not reproduced in this paper, but 
they can be obtained via the World Wide Web at 
http://vvvvvv.digi ta I .com/i n fo/h pc/f 90/. 

The Shallow-water Benchmark 

The shallow-water equations model atmospheric 
flows, tides, river and coastal flows, and other phe- 
nomena. The shallow -w ater bench mark program uses 
these equations to simulate a specific flow problem. It 
models variables related to the pressure, velocity, and 
vorricitv at each point of a two-dimensional mesh that 
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is a slice through cither the water or the atmosphere. 
Partial differential equations relate the variables. 
The model is implemented using a finite-difference 
method that approximates the partial differential 
equations at each of the mesh points.^ Models based 
on partial differential equations are at the core of many 
simulations of physical phenomena; finite difference 
methods are commonly used for solving such models 
on computers. 

The shallow-water program is a widely quoted 
benchmark, partly because the program is small 
enough to examine and tune carefully, yet it performs 
real computation representative of many scientific sim- 
ulations. Unlike SPEC, and other benchmarks, the 
source for the shallow-water program is not controlled. 

The shallow-water benchmark was written in HPF 
and run in parallel on workstation farms using PSE. 
There is no explicit message-passing code in the pro- 
gram. We modified the Fortran 90 version that 
Applied Parallel Research used for its benchmark data. 
The F90/HPF version of the program takes advantage 
of the new features in Fortran 90 such as modules. 
The Fortran 77 version of the program is an unmodi- 
fied version from Applied Parallel Research. 

The resulting programs were run on two hardware 
configurations: as many as eight 275-megahertz 
(MHz) DEC 3000 Model 900 workstations connected 
by a GIGAswitch system, and an eight-processor 
AlphaServer 8400 (300-MHz) system using shared- 
memory as the messaging medium. Table 1 gives the 
speedups obtained for the 512 X 512-sized problem, 
with [TMAXsetto 50. 

The speedups in each line are relative to the DEC 
Fortran 77 code, compiled with the DEC Fortran 
version 3.6 compiler and run on one processor. The 
DEC Fortran 90 -wsf compiler is the DEC Fortran 90 
version 1.3 compiler with the -wsf option ("parallel- 
ize HPF for a workstation farm") specified. Both 



compilers use version 3.58 of the Fortran RTL. The 
operating system used is Digital UNIX version 3.2. 

Table 1 indicates that this HPF version of shallow 
water scales very well to eight processors. In tact, we are 
getting apparent superlinear speedup in some cases. 
This is due in part to optimizations that the DEC 
Fortran 90 compiler performs that the serial compiler 
does not, and in part to cache effects: with more proces- 
sors, there is more cache. On the shared-memory 
machine, we are getting apparent superlinear speedups 
even when compared to the DEC Fortran 90 -wsf 
compiler's one -processor code; this is likely due to cache 
effects. The program appears to scale well beyond eight 
processors, though we have not made a benchmark- 
quality run on more than eight identical processors. 

For purposes of comparison, Table 2 gives the pub- 
lished speedups from Applied Parallel Research on the 
shallow-water benchmark for the IBM SP2 and Intel 
Paragon parallel architectures. The speedups shown 
are relative to the one-processor version of the code. 
This table indicates that the scaling achieved by the 
DEC; Fortran 90 compileron Alpha workstation farms 
is comparable to that achieved by Applied Parallel 
Research on dedicated parallel systems with high- 
speed parallel interconnects. 

A Conjugate-gradient Poisson Solver 

The Poisson partial differential equation is a work- 
horse of mathematical physics, occurring in problems 
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of heat How and electrostatic or gravitational poten- 
tial. We have investigated a Poisson solv er using the 
conjugate-gradient algorithm. The code exercises 
both the nearest-neighbor optimizations and the 
inlining abilities of'the DEC Fortran 90 compiler. 25 

Table 3 gives the timings and speedup obtained 
on a 1000 X 1000 array. The hardware and software 
configurations arc identical to those used for the 
shallow-water timings. 

Red-black Relaxation 

A common method of solving partial differential 
equations is red-black relaxation. 2 ' 1 We used this 
method to solve the Poisson equation in a three- 
dimensional cube. We compare the parallelization 
of this algorithm for a disrributed-memorv system 
(a cluster of Digital Alpha workstations) with Parallel 
Virtual Machine (PVM), which is an explicit message- 
passing library, and with HPF." These algorithms are 
based on codes written by Klose, Wolton, and Lemke 
and made available as part of the suite of GENESIS 
distributcd-memory benchmarks. 2S 

Table 4 gives the speed ups obtained for both 
the HPF and PVM v ersions of the program, which 
solves a 128 X 128 X 128 problem, on a cluster of 
DEC 3000 Model 900 workstations connected by an 
FDDI/GIGAswitch system. The speedups shown arc 
relative to DEC Fortran 77 code written for and run on 
a single processor. This table shows that the HPF ver- 
sion performs somewhat better than the PVM version. 

There is a significant difference in the complexity of 
the programs, however. The PVM code is quite intri- 
cate, because it requires that the user be responsible 
for the block partitioning of the volume, and then for 
explicitly copying boundary faces between processors. 
Bv contrast, the HPF code is intuitive and far more 
easilv maintained. The reader is encouraged to obtain 
the codes (as described above) and compare them. 
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In conclusion, we have shown that important algo- 
rithms familiar to the scientific and technical commu- 
nity can be written in HPF. HPF codes scale well to at 
least eight processors on farms of Alpha workstations 
with PSE and deliver speedups competitive with other 
vendors' dedicated parallel architectures. 
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Digital's Parallel Software Environment was 
designed to support the development and exe- 
cution of scalable parallel applications on clus- 
ters (farms) of distributed- and shared-memory 
Alpha processors running the Digital UNIX oper- 
ating system. PSE supports the parallel execu- 
tion of High Performance Fortran applications 
with message-passing libraries that meetthe 
low-latency and high-bandwidth communica- 
tion requirements of efficient parallel comput- 
ing. It provides system management tools to 
create clusters for distributed parallel process- 
ing and development tools to debug and pro- 
file HPF programs. An extended version of dbx 
allows HPF-distributed arrays to be viewed, 
and a parallel profiler supports both program 
counter and interval sampling. PSE also supplies 
generic facilities required by other parallel lan- 
guages and systems. 



Digital's Parallel Software Environment (PSH) was 
designed to support the development and execution 
of scalable parallel applications on clusters (farms) of 
distributed- and shared -memory Alpha processors 
running the Digital UNIX operating system. PSE 
version 1.0 supports the High Performance Fortran 
(HPF) language; it also supplies generic facilities 
required bv other parallel languages and systems. PSE 
provides tools to define a cluster of processors and to 
manage distributed parallel execution. It also contains 
development tools for debugging and profiling paral- 
lel HPF programs. PSF supports optimized message 
passing over multiple interconnect types, including 
fiber distributed data interface (FDDI), asynchronous 
transfer mode (ATM), and shared memory.' 

In this paper, we present an overview of PSH version 
1.0 and explain why it was designed and selected 
for use with HPF programs. We then discuss cluster 
definition and management, describe the PSE appli- 
cation model, and discuss PSF.'s message-passing com- 
munication options, including an optimized transport 
for message passing. We conclude with our perfor- 
mance results. 

Overview of PSE 

Manx researchers and computer industry experts 
believe that to achieve cost-effective scalable parallel 
processing, systems must be built using off-the- 
shelf components and not specialized CPUs and 
interconnects.' : In accordance with this view, we 
have designed Digital's PSH to support the building 
of a consistent yet flexible and easy-to-use parallel- 
processing environment across a networked collection 
of AlphaGcnerarion workstations, servers, and sym- 
metric multiprocessors (SMPs). Layered on top of the 
Digital UNIX operating system, PSE prov ides the sys- 
tem software and tools needed to group collections of 
machines for parallel processing and to manage trans- 
parently the distribution and running of parallel appli- 
cations. PSE is implemented as a set of run-time 
libraries and utilities and a daemon process. 

PSH version 1 .0 is designed to support clusters con- 
sisting of 1 to 256 machines interconnected with any 
networking fabric that Digital UNIX supports w ith the 
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transmission control protocol/internet protocol 
(TCP/I) 1 )- Networking technologies can range from 
simple Ethernet to FDDI, ATM, and MEMORY 
CHANNEL. Parallel execution is most efficient w hen 
the interconnect technology offers high-bandwidth 
and low-latency communications to the user at the 
process level. When building a cluster for parallel pro- 
cessing, the bisectional bandwidth of the communica- 
tions fabric should scale with the number of processors 
in the cluster. In practice, such a configuration can be 
achieved bv building clusters using Alpha processors 
and Digital's CilGAswitch/FDOT as components in a 
multistage switch configuration.' 13 Figures 1 and 2 
show two examples of PSE cluster configurations. 
Although the design center for PSE is a set of machines 
connected by a high-speed local area interconnect, a 
cluster can be constructed that includes remote 
machines connected bv a w ide area network. 

PSE is a collection of many interrelated entities that 
support parallel processing. PSE's model is to collect 
machines (called members) into a set (called a cluster). 
The members are generally all the machines at a site or 
within an organization that have or might have PSE 
installed. One then subsets the cluster into named 
(partitions) that may overlap. The members of a parti- 
tion usually share some common attribute, which 
could be administrative (e.g., the machines of the 
development group), geographic (e.g., connected to 
the same FDDI sw itch), or relevant to the configura- 
tion (e.g., large memory, SMP). 

The members of a cluster, the partitions, and other 
related tiara form a configuration database that can be 
maintained in different ways, but preferably bv a sys- 
tem administrator. The configuration database can be 
distributed using the Domain Name System (DNS) or 
as a simple file distributed bv Network File System 
(NFS).'' A daemon process farmd runs on each mem- 
ber to provide per-member dynamic information, 



such as availability and system load average. The static 
database plus the dynamic information allow applica- 
tions to perform tasks such as load balancing. 

HPF Program Support 

PSE was designed to be largely language-independent; 
it currently supports the HPF programming language. 
H PF allows programmers to express data parallel com- 
putations easily using Fortran 90 array-operation syn- 
tax. As a result, users can obtain the benefits of parallel 
processing without becoming systems programmers 
and developing message passing or threads-based pro- 
grams. The HPF language and compiler are discussed 
elsewhere in this issue of the Digital Technical 
Journal' 

Writing parallel applications in HPF is significantly 
less complex than decomposing a problem and coding 
a solution using explicit message passing, but good 
development tools are required. To allow the viewing 
of HPF distributed arrays, we developed an extended 
version ofdbx and a parallel profiler that supports both 
program counter and interval sampling. These tools 
are discussed later in this paper. 

High performance and efficient communication are 
essential to success in parallel processing. PSE includes 
a private message-passing library for use with compiler- 
generated code. Thus it avoids overhead such as buffer 
alignment and size checking that are required with 
user-visible programming interfaces, such as Parallel 
Virtual Machine (PVM). K The message- passing library 
supports shared memory and both TCP/IP and user 
datagram protocol (UDP)/IP protocols on many 
types of media, including FDDI and ATM. PSE also 
includes an optional subset implementation of the 
UDP, known as UDP_prime, that has been optimized 
to reduce latency and improve efficiency. This opti- 
mization is discussed later in this paper. 
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Before developing PSK for use with H.P1- programs, 
Digital considered two major alternativ es: the distrib- 
uted computing environment (DCK) and PVM.*'' 
(At that time, the message-passing interface |MPIJ 
standard effort was in progress. 1 ") 

Although a good model for client-server application 
deployment, DCK is designed for use with remote CPU 
resources via procedure calls to libraries. This model 
is very different from the data-parallel and message- 
passing nature of distributed parallel processing. Irs 
synchronous procedure call model requires the exten- 
sive use of threads. In addition, DCK contains a signif- 
icant number of setup and management tasks. For 
these reasons, we rejected the DCE environment. 



Three major considerations in our choice to develop 
PSK instead of using PVM w ere stability, performance, 
and transparency. At the start of the PSK project, the 
publicly available version of PVM did not meet the sta- 
bility, performance, and transparency goals of the PSK 
project. 

Cluster Definition and Management 

PSK is designed to operate in a common system envi- 
ronment w here systems are organized so that user 
access, file name space, host names, and so on are con- 
sistent. The ultimate goal for the systems in a distrib- 
uted parallel-processing environment is to approach 
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the transparent usability of a symmetric multiproces- 
sor. Facilities such as NFS (to mount/share file systems 
among machines, in particular working directories) 
and network information serv ice ( NIS) (also know n as 
"yellow pages" and used to share password files) are 
frequently used to set up a common system environ- 
ment. In such an environment, users can log into anv 
machine and see the same environment. Other distrib- 
uted environments such as Load Sharing Facility 
( LSF) make this same design assumption." 

A consistent file name space allows all processes that 
make up an application to have the same file system 
view by simply changing directory to the working 
directory of the invoking application. Consistent user 
access allows PSF, to use the standard UNIX remote 
shell facility to start up peer processes with standard 
securitv checking. 

Systems in a common system environment are can- 
didates to become members of a cluster. A cluster is 
often the largest set of machines running PSK and 
sharing a common system environment within an 
organization or site. A cluster is divided into partitions 
that can overlap. A partition consists of a set of 
machines grouped together to meet the needs of an 
application or user. Although partitions may be 
defined in manv ways, svstems in a partition usually 
share common attributes. 

Partitions 

Parallel programs run most efficiently on a balanced 
hardware configuration. Typically, organizations have 
a varied collection of machines. Over time, organiza- 
tions often acquire new hardware with different net- 
work adapters, faster CPUs, and more memory. Such 
situations can easily lead to increasing difficulty in 
predicting application performance if scheduling 
and load-balancing algorithms treat all machines in 
a cluster equivalenrlv. In addition to hardware differ- 
ences, indiv idual machines can have different software 
installed that affects the ability to run applications. 

The PSF engineering team recognized that the 
number of characteristics that users might want to 
manage for processor allocation and load-balancing 
purposes would be overwhelming. To limit the prob- 
lem, a design was chosen that allows machines to be 
grouped arbitrarily into named partitions. A partition 
can be thought of as a parallel machine. Although 
a system can be a member of two different partitions 
and therefore cause overlap, PSK does nor attempt to 
load balance or schedule processes beyond partition 
boundaries. Overlapping partitions can therefore cre- 
ate a complex and potentially conflicting scheduling 
situation. Well-defined and managed partitions allow 
for flexibility and predictability. 

In addition to identifying machine membership, 
partition definition allows various execution-related 



characteristics to be set. F.xamples include the specifi- 
cation of a default communication type, the default 
execution priority, the upper bound on the execution 
priority, and access control to partition resources. 
Access control is enforced only on PSE-relared activity 
and does not af fect the use of the machine for other 
applications. 

Configuration Database 

PSE cluster configuration information is captured in 
a database. The database includes a list of cluster mem- 
bers, partitions, and partition members. Additional 
attributes such as the default partition of a cluster, user 
access lists for a partition, and preferred network 
addresses for members of a partition can be encoded in 
the database. 

The PSF configuration database can be distributed 
to all cluster members in two ways: by storing it in 
a file that is accessible from all cluster members, or by 
storing it as a Domain Name System (DNS) database. 
The usage patterns of the cluster database fit well with 
the usage patterns of a DNS database. In particular, 
DNS provides central administrative control with 
version numbering to maintain consistency during 
updates. It is designed for query-often, update-seldom 
usage; it is distributed and allows secondary servers to 
increase availability. Applications linked with the PSF 
run-rime libraries transparently access the database to 
obtain configuration information. 

In the DNS database, each PSE. configuration 
token-value pair is stored as DNS TXT records. The 
original specification for DNS did not have TXT 
records, but additional general information was 
attached to domain names at the request of MIT's 
Project Athena. 13 The list of the TXT records, along 
with DNS header information such as version number, 
forms a DNS domain whose name is the PSE cluster 
name. To facilitate the creation and setup of a PSF 
cluster, we built the psedbedit utility for editing and 
maintaining configuration databases. 

A simple file that is available on all members of the 
cluster can also be used as the cluster configuration 
database. The file could be made available through 
NFS or copied to all nodes using rdist. This alternative 
might be appropriate for very simple clusters where 
the services of DNS are nor warranted or in cases 
where local policy precludes the use of DNS. 

Dynamic Information and Control 

In addition to the static information of the configura- 
tion database, there are also several pieces of dynamic 
information that optimize usage of clusters and parti- 
tions. At the most fundamental level is availability, i.e., 
is a machine running: Other information includes the 
number of CPUs, load average, number of allowed 
PSE jobs, and number of active PSF, jobs. All these 
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factors can help an application choose the best set of 
members for parallel execution. This dynamic informa- 
tion is collected bv a daemon process (farmd). The 
farmd daemon process executes as a privileged (root) 
process on each cluster member and listens tor requests 
on a well-known cluster-specific UDP/IP port. 

Multiple cluster members defined in the configura- 
tion database are designated as load servers. The load 
servers arc the central repository for the dynamic 
information for the entire cluster. Their farmd process 
periodically receives time-stamped updates from the 
individual daemons. Applications query the load 
servers for both static and dynamic information. 
Applications do not themselves parse the database nor 
query the individual farmd daemons running on each 
cluster member. 

Once PSE is installed and configured, farmd is 
started each time the system is booted. The name of 
the cluster that farmd will service and the number of 
PSE jobs (job slots) that wil l be allowed to run are set. 
The inetd facility is used to restart farmd in response to 
UDP/IP connection requests, if farmd is not run- 
ning. 1 '' Use of the inetd facility to start farmd improves 
the availability of machines to run PSE applications by 
transparently restarting farmd in the case of a failure. 

As farmd daemons are started, they attempt to 
establish TCP/IP connections w ith their neighbors as 
defined bv the PSE configuration database. 14 This 
process is undertaken by all cluster members and 
quickly results in a configuration ring whose purpose 
is the detection of node or network failures. We chose 
a simple ring of TCP/IP connections because the 
mechanism is passive, i.e., it relies on the loss of 
TCP/IP connectivity and does not impose any addi- 
tional load on the system or network under normal 
conditions. When connectivity to a member is lost, 
neighboring cluster members report the member 
being unavailable. This prevents PSE from attempting 
to schedule new applications on the failed member. 

Failures that do not break the configuration ring, bur 
prevent updated load information from being sent to 
the load server, arc detected by checking the time- 
stamps on previously received load information. As 
soon as a "timc-to-live" period expires for a particular 
member's load information, the load servers disable fur- 
ther use of the suspect node. System managers are also 
able to set the number of job slots to zero at any time, 
thus disabling the host for new PSE-relatcd activities. 
This has no effect on currently executing applications. 

Pseudo-gang Scheduling 

The start-up sequence for a PSE application includes 
the potential modification of execution priority and 
scheduling policy. These changes are made in accor- 
dance with the user command -line options and/ or the 
default characteristics defined by the PSE configura- 
tion database. To allow nonroot U1D processes to 



elevate scheduling priorities and/or alternate sched- 
uling policies, farmd modifies the user process's 
scheduling priority or policy. Processes scheduled at 
a high real-rime priority using a first in, first out 
(FIFO) queue with preemption policy achieve a 
pseudo-gang-scheduling effect. (Gang scheduling 
ensures that all processes associated with a job are 
scheduled simultaneously.) This effect occurs because 
of the scheduling preference given high -priority jobs 
and because PSE polls for messages for a period of 
time before giving up the CPU. 

Using PSE 

Parallel applications are developed for PSE using the 
Digital Fortran 90 compiler. When the Fortran 90 
compiler is invoked with the -wsf N flag, HPF source 
codes are compiled and then linked with a PSE library 
for parallel execution on /V processors. After defining a 
partition in which to run, a PSE application can be run 
simply by typing the name of the application. The fol- 
lowing example shows the compilation and execution 
of a four-process program called myprog on a set of 
cluster members in the partition named fast. 

csh> setenv PS E_P A R T I T 1 0 N fast 

csh> f90 -wsf 4 myprog.f90 -o myprog 

csh> myprog > myprog. out < myprog.dat 8 

Transparently, PSE starts up four processes on 
members of the partition fast; creates communications 
channels between the processes; supports redirected 
standard input, output, and error (standard I/O); and 
controls the execution and termination of the applica- 
tion. Several environment variables and run-time Hags 
are available to control how an application executes. 
Figure 3 shows how to use PSE. 

PSE Application Model 

PSE implements an application as a collection of inter- 
connected processes. The initial process created when a 
user runs an application is called the controlling pmcess. 
It provides application distribution and start-up services 
and preserves UNIX user-interface semantics (i.e., stan- 
dard I/O), but does not participate in the HPF parallel 
computation. The controlling process usually deter- 
mines which partition members to use for the parallel 
computation bv getting system load information from 
a load server and then distributing the new processes 
across the partition. As an alternative, users can direct 
computation onto specific partition members. 

The controlling process starts a process called the 
io_mcmciger on each partition member participat- 
ing in the parallel execution. F.ach io_managcr then 
starts one or more application peer processes that 
perform the user-specified computation. The use of 
an io_managcr is necessary to create a parent-child 
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PSE Use 

process relationship between the io_manager and peer 
processes. This relationship is used for exit status report- 
ing and process control. It also enables or eases other 
activities, such as signal handling and propagation. Peer 
processes create communication channels between 
themselves and perform standard I/O through a desig- 
nated peer. Standard I/O is forwarded to and from the 
controlling process through the io_manager. Figure 4 
shows a PSE application structure. 

Application Initialization 

Prior to the execution of any user code, an initializa- 
tion routine executes automatically through function- 
alirv provided by the linker and loader. The 
initialization routine implements both the controlling 
process functions and the HPF-specitic peer initializa- 
tion. Because no explicit call is required, parallel HPF 
procedures can be used within non-HPF main pro- 
grams, and proper initialization will occur. A simple 
HPF main program can also be used with PSE to start 
up and manage a task-parallel application that uses 
PVM or MP I for message passing. 

In general, the controlling process places peer 
processes onto members of a partition, although hand 
placement of individual peers onto selected members 



is possible. To achieve efficiency and fairness in map- 
ping a set of peers, the controlling process consults 
with a load server for load-balancing information. 
Which members are used and the order in which they 
are used is based on each member's load average, 
number of CPUs, and number of available job slots. 

As an alternative, PSE may map peer processes onto 
members based upon a user-selected mode of opera- 
tion. In the default physical mode of operation, PSE 
maps one peer process per member. In virtual mode, 
PSE allows more than one peer process per member, 
thereby enabling large virtual clusters. This is useful 
for developing and debugging parallel programs on 
limited resources. Virtual clusters also improve appli- 
cation availability: when the requested number of peer 
processes is greater than the available set of partition 
members, applications continue to run; however, they 
may suffer performance degradation. 

Application Peer Execution 

Each application peer process has an io_manager 
parent process that provides it with environment 
initialization, e.xit value processing, I/O buffering, 
signal forwarding, and potential scheduling priority 
and policv modification. Rather than include the 
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io_m.mager\ functions in each PSK executable, 
the io_m;uiager is implemented .is a simple utility. 

Application peers run the same binary image as the 
controlling process. They inherit their current working 
directory, resource usage limits, and an augmented set 
of environment variables from their controlling process 
through their parent io_manager. When started, the 
initialization process described for the controlling 
process is repeated, but peers do nor become control- 
ling processes because they detect that a controlling 
process already exists. Instead, peer processes return 
from the initialization routines with communication 
links established and are readv to run user-application 
code. Figure 5 represents a controlling process, four 
application peers running on three members, and the 
communications between processes. 



Application Exit 

Multiple peer exits can have potentiallv conflicting exit 
values. Coordinating them into a single meaningful 
application exit value is the most challenging trans- 
parency' issue faced bv PSK. Under normal circum- 
stances, all peer processes exit without error and at 
approximately the same time. The resulting exit values 
arc reported to the application controlling process by 
the io_managers. The application (i.e., the controlling 
process) is allowed to exit without error only when all 
exit values are recorded and standard I/O connections 
are drained and closed. The HPF compiler generates 
synchronization code to guarantee the roughlv syn- 
chronous exit for all nonerror conditions. This pre- 
sumption allows PSE to implement a timelv exit 
model, i.e., one by which we can reasonably assume 
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normal activity will cease after receiving the last exit 
notification from an io_manager. 

Peers that exit abnormally make it difficult to 
provide a meaningful exit value for the application. 
Consider one peer process that exits due to a segmen- 
tation fault and another that exits due to a floating- 
point exception. There is no single exit value possible 
for the application, PSE chooses the first abnormal 
value it sees. Furthermore, as a result of error detec- 
tion in the communication library, the other peer 
processes will exit with lost network connections. It is 
possible that the controlling process will see an exit 
value for this ef fect before it sees an exit value for one 
of the causes, resulting in a misleading application exit 
value. To understand a faulting parallel application 
running under PSH, the core files associated with each 
peer process must be examined. 

PSH includes support for capturing the entire appli- 
cation core state and for discriminating the multiple 
core files of a parallel application. Because peer pro- 
cesses share the same working directory, any core tiles 
generated would be inconsistent and overwrite one 
another due to A' processes writing to the same core 
tile name. PSE solves this problem bv establish- 
ing a signal handler that catches core-generating sig- 
nals, creates a peer-specific subdirectory, changes to 
the new directory, and resignals the signal to cause the 
writing of the core file. The root for the core directo- 
ries can be set through an environment variable. 

Issues 

Although PSH achieves the standard UNIX look-and- 
teel tor most application situations, complete trans- 
parency is not achieved. For example, timing an 
application-controlling process using the e-shelPs 
built-in time command, does not time user code or 
provide meaningful statistics other than the elapsed 
wall clock time to start a parallel application and to tear 
it down. Another situation that highlights the parallel 
nature of PSH occurs during application debugging: 
multiple debug sessions are started by running the 
application with a debugger flag rather than by using 
dbx directlv. 

Tools for HPF Programming 

The development model for HPF- based applications 
is a two-step process. First, a serial Fortran 90 program 
is written, debugged, and optimized. Then it is paral- 
lelized with HPF directives and again debugged and 
optimized. The development tools supplied with PSH 
address profiling and debugging. Unlike most of PSH, 
which is language-independent, both the pprof profil- 
ing facility and the "dbx in n windows" debugging 
facility are specific to HPF programming. 



Profiling 

Several issues in profiling parallel HPF programs do 
not applv to Fortran programs that execute serially. 
HPF execution occurs through multiple processes on 
multiple processors simultaneously and therefore pro- 
duces multiple profiling data sets. The storage and 
analysis of these data sets must be coordinated to pro- 
duce accurate and comprehensive program profiles. 
Unlike typical Fortran programs, significant time can 
be spent communicating in an HPF program. The 
Digital UNIX prof and pixie utilities do not handle 
either of these issues. 15 In addition, the prof utility has 
coarse-grained ( 1 -millisecond resolution) program 
counter (PC) sampling and reports only down to the 
procedure level. To address these issues, Digital added 
profiling support to the Fortran 90 compiler and 
developed the pprof analysis tool. 

Data Collecting The PSH parallel profiling facility 
handles profiling data collection in parallel by writing 
data to a set of files that are uniquely named. It 
encodes the application name, the type of data collec- 
tion, and the peer number of the process. The analysis 
tool pprof merges the data in the file set when per- 
forming analysis and producing reports. 

It supports two types of data collecting: nonin- 
trusive traditional PC sampling and intrusive interv al 
profiling. PC sampling simplv records the program 
counter at each occurrence of the svstem clock inter- 
val interrupt. To achieve an accurate execution profile 
with PC sampling, programs must be long running 
to become statistically significant. Also, it is difficult to 
gather do-loop iteration data using PC sampling. 

We developed interval profiling support to overcome 
the deficiencies of PC sampling. Interval profiling is 
achieved with compiler-inserted functions that record 
the entry and exit times for the execution of each event. 
This produces an accurate execution profile. Events 
include routines, array assignments, do loops, FORALL 
constructs, message sends, and message receives. 
Because the entry and exit times are recorded, rime 
spent executing other events within an event is 
included, which gives a hierarchical profile. To achieve 
fine-resolution timings (single-digit nanoseconds), the 
Alpha process cycle counter is used to measure time." 1 

Analysis The pprof utility provides many different 
ways to examine and report on a large set of profiling 
data from a parallel program execution. Different 
approaches include focusing on routines, statements, 
or communications. In contrast, prof reports on proce- 
dures only. With pprof, the scope of the analysis can be 
limited to a single peer process or encompass all appli- 
cation processes. The range of reports generated can be 
comprehensive or limited to a number of events or 
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a percentage of rime. Users can specify their reports 
from a combination of analysis, report format, and 
scoping options. By default, the pprof utility reports on 
routine-level activity av eraged across all peer processes, 
which provides an overall view of application behavior. 

Parallel programs execute most efficiently when 
there is minimum communication betw een processes. 
The high-level, data parallel nature of the HPF 
language reduces the visibility of communication to 
the programmer. To make tuning easier, pprof was 
designed w ith the ability to focus tuning on communi- 
cation. Reports can be generated that help correlate 
the use of HPF data-distribution directives to 
observed communication activities. 

Debugging 

For PSK version 1 .0, we are supplying a "dbx in n win- 
dows" capability. Each peer is controlled bv a separate 
instance of dbx that has its own Xterm w indow This 
capability gives users basic debugging functionality, 
including the ability to set breakpoints, get backtraces, 
and examine variables on an all-peer or a per- peer 
basis. We added a new command to dbx, hpfget, that 
allows the viewing of individual elements of a distrib- 
uted arrav. We recognize it as far from meeting the 
challenges of an HPF debugger, and we arc continuing 
the development of a new debugging technology. 

Message-passing Model 

One of the goals of PSK is to support high-performance, 
reliable message passing for parallel applications. At 
the start of the project, the HPF language and com- 
piler technology were still in their infancy. F.ven 
though no HPF application code base existed, the PSK 
team needed to determine the messaging- passing 
requirements. To support message passing success- 
fully, PSK had to be flexible enough to accommodate 
new interconnect technologies and network proto- 
cols, adapt to the message -passing characteristics of 
future HPF applications, and support the changing 
demands of the compiler. A need for high perfor- 
mance and efficiency with low latency was assumed. 

The PSH message-passing facility provides primi- 
tives to initialize and terminate message-passing oper- 
ations, to allocate and deallocate message buffers, and 
to send and receive messages. A PSH message contains 
a tag, a source peer number, and variable-length dara. 
The higher layers fill in the tag, which is used as a mes- 
sage identifier on receive. The dara is a stream of bytes 
without any data-tvpc information. These primitives 
are nor intended to be used in the application code. 
The HPF compiler implicitly generates calls to these 
primitives. Because the message-passing primitives are 
tightly coupled to the H PF compiler, overhead such as 
data-alignment restrictions and error checking can be 
eliminated. 



The PSE message-passing model assumes that the 
application peers are running on systems with the same 
CPU architecture and networking capabilities. Kach 
peer process can send or receive binary messages 
directly to or from any other peer. This is different from 
the PVM model, where messages might be routed to 
a pvmd daemon to be multiplexed to another peer, or 
messages might be converted to external data represen- 
tation (XDR) to allow for data passing between 
machines vv irh different architectures.'" 

Buffer allocation and deallocation routines are spe- 
cific to each of the communication options that PSE 
supports. (These options are discussed in the follow- 
ing sections.) Before a message can be sent, a buffer 
must be allocated. The send primitive sends the mes- 
sage and implicitly deallocates the buffer. The receive 
primitive implicitly allocates a buffer containing the 
newly arrived message. Receive buffers have to be 
deallocated explicitly after they arc used. Our initial 
design allowed a receiv ed message buffer to be reused 
for sending a new message, possibly to a different peer. 
This design was inefficient, especially when a commu- 
nication option such as shared memory optimizes 
buffer allocation on a pecr-bv-pecr basis. The current 
design uses a peer number as a parameter to the buffer 
allocation routine and docs not allow reuse of the 
received message buffer. 

The send primitive sends a message contained in 
a prcallocatcd buffer to a specified peer. It guarantees 
reliable in-order deliv ery of messages. For underlying 
protocols, such as UDP/IP that do not provide this 
level of service, the message-passing library must pro- 
vide it. A broadcast primitive is also prov ided to send 
a single message to all peers. 

The receive primitive uses a particular message tag 
to receive a message with a matching tag from any 
peer. This allows the compile!' to use functions that can 
perform calculations correctly when data is required 
from several peers, regardless of the order in which 
messages arrive. The normal operation for receive is 
to block the receiving peer until a matching tagged 
message arrives. A nonblocking receive is also pro- 
vided to poll for messages. 

Communication Options 

PSE provides applications with several run-time selec- 
table communication options. Within a single SMP 
system, PSE supports message passing over shared 
memory. On multiple system configurations, PSE sup- 
ports network message passing using the TCP/ IP or 
UDP/IP protocols over anv network media that the 
Digital UNIX operating system supports. Currently, 
PSE supports a single communication option within 
an application execution, but the design supports 
multiple protocols and interconnects. Rim-time selec- 
tion of the communication options and media, which 
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is implemented using a vector of pointers to functions 
within a shared library, provides flexibility to introduce 
new protocols and media w ithout having to recompile 
or relink existing applications. 

Shared-memory Message Passing 

The use of shared memory as a message-passing 
medium allows for very high performance because 
data does not have to be copied. When designing 
shared-mcmorv messaging, we looked at a variety of 
interrelated issues, including coordination mecha- 
nisms, mcmon -sharing strategics, and memory con- 
sumption. The use of locks (i.e., semaphores) in the 
traditional manner to coordinate access to shared- 
mcmorv segments proved problematic. For example, 
clients often request a message from any peer, nor 
from a particular peer. This implies the use of a general 
receive semaphore that senders would unlock after 
deliv ering data. Contention for a single lock could be 
significant and could become a performance bottle- 
neck. Instead of locks, a simple set of producer and 
consumer indexes is used to manage a ring buffer of 
messages. Senders read the consumer index and 
update the producer index, and receivers read the pro- 
duce!' index and update the consumer index to syn- 
chronize. No locking is required. 

Several memory-sharing strategies are possible: all 
peers may share a single large segment, each pair of 
peers may share a segment, and each pair of peers inav 
have a pair of unidirectional segments. The use of unidi- 
rectional pairs of shared-memorv segments offers sev- 
eral advantages: it simplifies the code by eliminating 
multiplexing; it tits in well w ith the design of MKMORY 
CHANNEL hardware, which is unidirectional; and by 
creating receive segments with read-only protection, it 
promotes robustness. IS A disadvantage to the use of 
unidirectional segment pairs is increased memory use 
due to limited sharing. Because of its advantages and 
because the coordination of the producer/consumer 
index does not require segments to be shared between 
peers, we selected unidirectional pairs of shared- 
memory segments as our memory-sharing strategy. 

To enhance performance, a receiver spins, waiting 
for a peer to produce a message. If there is no data 
after a number of spin iterations, the receiver voluntar- 
ily deschedules itself. The number of spin iterations 
was chosen to be small enough to be polite, but large 
enough to permit scheduling when a peer produced 
a message. An additional performance enhancement 
allows the user, via command line option, to prevent 
peers from migrating between processors, which 
results in better cache utilization. 

TCP/IP Message Passing 

TCP/IP is the default communication option. If pro- 
vides full wire bandwidth for peer-to-peer communi- 
cation with large message transfer sizes across a variety 



of network media. The implementation of the message- 
passing primitive operations is relatively straight- 
forward since TCP/IP provides reliable, in-order, 
connection-oriented deliv ery of messages. The TCP/ 
IP initialization routine sets up a v ector of bound and 
connected socket descriptors, one for each peer. These 
sockets are used to send messages to other peers. The 
receive primitive uses a blocking select( ) system call on 
all sockets. Because TCP/IP is connection based, 
abnormal peer termination and network faults can be 
detected by connection loss. 

Although TCP/IP provides acceptable bandwidth, 
latency-sensitive applications might suffer from the 
processing overhead of the TCP/ IP protocol. The 
connection-oriented nature of TCP/IP also requires 
the application to maintain many socket descriptors, 
which reduces scalability and necessitates the use of 
expensive select( ) svstcm calls on receive. 

UDP/IP Message Passing 

To address the latency and overhead of TCP/IP, PSE 
provides UDP/IP as an option that can be selected at 
run rime. UDP/IP is a connectionless protocol that 
provides unordered, best-effort delivery of messages. 
Because UDP/IP is connectionless, the initialization 
function needs to set up a single locally bound socket 
description for all pecr-ro-pcer communication. File 
descriptor use is not a scaling issue when UDP/IP 
is used to r messaging. 

Reliable in-order delivery of messages is imple- 
mented at the library level. F,ach peer maintains a set of 
send and receive ring buffers, one for each peer. The 
ring buffers have producer and consumer indexes 
to indicate positions in the ring where messages can 
be react or written. The buffer-allocation primitive 
allocates buffers from the send ring whenever possible, 
or from a pool of overflow buffers when the ring is full. 
The use of an overflow buffer eliminates the need for 
upper levels to provide How control or to block sends. 
The send and receive primitives manipulate the pro- 
ducer and consumer indexes of the send and receive 
rings. In-order delivery of messages is guaranteed 
through the use of a sliding window protocol with 
sequentially numbered messages. For efficiency, piggy- 
backed acknowledgments are used. 

To improve scheduling synchronization among 
multiple peers, especially when a high-priority FIFO 
scheduling policy is used, the UDP/IP option uses a 
nonlocking socket. On receive, it loops calling the 
recvfrom() system call many times before calling the 
expensive select() system call to waif for a message to 
arrive. Abnormal peer termination and network faults 
cannot be detected since the socket laver does nor 
maintain a connection state. The I'DP/IP option con- 
tains a user-specifiable time-out value bv which the peer 
application will exit w hen there is no socket activity. 
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The UDP/IP option provides better bandwidth 
than the TCP/IP with smaller messages and matches 
the TCP/IP bandwidth at large message size. The 
user-level latency reduction, however, was less than 
expected. The next two sections discuss our investiga- 
tion into ways to optimize the latency of UDP/IP and 
the performance of the message-passing options. 

Optimizing UDP/IP 

Our initial approach to improve latency was to reex- 
amine the standard UDP/IP code path within the 
Digital UNIX kernel for unnecessary overhead. Our 
idea was to create a faster path, optimized for a 
UDP/IP over a local area network (LAN) configura- 
tion by reducing numerous conditional checks in the 
path. Although this work yielded some improvement, 
it was not enough to justify supporting a deviation 
from the standard code path. An overhaul of the origi- 
nal code path would have been necessary for this 
approach to gain significant improvement in latency. 

UDP/IP provides a general transport protocol, 
capable of running across a range of network inter- 
faces. We realize the value in retaining the generality 
of UDP/IP. For optimal performance, however, we 
anticipate tvpical cluster configurations being con- 
structed using a high-performance switched LAN 
technology such as the GIGAswitch/FDDI system/ 
In such configurations, the IP family of protocols 
presents unnecessary protocol -processing overhead. 
A messaging system using a lower-level protocol, such 
as native FDDI, would offer better latency, but its 
implementation requires the use of nonstandard mech- 
anisms to access the data link layer directly, which is less 
general and portable than a UDP/IP implementation. 

Based on the above observations, we designed a new 
protocol stack in the kernel, called UDP_prime, to 
coexist with the standard UDP/IP stack. UDP_prime 
packets conform to the UDP/IP specification. 19 To 
reduce the amount of per-packet processing and 
approach that of a lower-level protocol, UDP_prime 
imposes several restrictions on its use. These restric- 
tions optimize the typical switched LAN cluster config- 
urations. To retain the generality of UDP/IP, 
UDP_prime falls back to the standard UDP/ IP stack 
when these restrictions are not applicable. 

Restrictions on UDP_ prime 

The LAN nature of the cluster configuration imposes 
a restriction on LJDP_prime. Each cluster member has 
to be within the same IP subnet, which is directly 
accessible from any other member. With this restric- 
tion, routing decision and internet-to-hardvvare 
address resolution can be done once for each peer- 
to-peer connection rather than on a per-packet basis. 
Per-packet UDP/IP checksum processing can also 
be eliminated, because intermediate routing is not 



involved and the data link cyclic redundancy check 
(CRC) is sufficient to guarantee error-free packets. 

The next restriction is the maximum length of the 
message. PSE message passing uses fixed-size buffers. 
UDP_prime restricts the maximum buffer size to be 
the maximum transmission unit (MTU) of the underly- 
ing network interface. This eliminates per-message IP 
fragmentation and defragmentation overhead. Since 
the messaging clients have to fragment the messages 
into fixed-size buffers at the higher layer, there is no 
need for the IP layer to perform further fragmentation. 

One complication in our current implementation 
occurs when multiple peers are running on a single 
system while others are on remote systems. The 
default behavior for peers within a single system is 
to communicate across the loopback interface. In this 
situation, there are two MTU values, one for the net- 
work interface and one for the loopback interface. 
Our current implementation of UDP prime does not 
allow communication over the loopback interface so 
that a single-size MTU can be used. Further studies 
need to be done to find an optimal maximum buffer 
size, taking into account multiple MTU values, page 
alignment, and so forth. 

Based on the above restrictions, UDP_prime opti- 
mizes the per-packet processing overhead of sending a 
packet by constructing a UDP, IP, and data link packet 
header template for each peer at initialization. Except 
for a few fields, the content of these headers is static 
with respect to a particular peer. UDP_ prime defines a 
new IP option, IP_UDP_PRIME, for the setsockopt( ) 
system call, to allow the messaging system to define 
the setof peers and their Internet addresses involved in 
the application execution. 2 " The IP option processing, 
done prior to sending any message, makes routing 
decisions, performs Internet-to-hardware address res- 
olution, and fills in the static portion of the header 
fields. When sending a packet, UDP_prime simply 
copies the header template to the beginning of the 
packet, minimizing the per-packet processing over- 
head and increasing the likelihood of the templates 
being in the CPU cache. Several header fields, such as 
the IP identification, header checksum, and packet 
length fields, are then filled dynamically, and the com- 
plete packet is presented to the interface layer. 

UDP_ prime Packet Processing 

Since a UDP_prime packet is a UDP/IP packet, the 
standard UDP/IP receive processing can handle the 
packet and deliver it to the messaging client. To trig- 
ger the use of UDP_ prime optimized receive process- 
ing, the sending system uses the type of service (TOS) 
field within the IP header to specify priority' delivery of 
the packet. 21 The priority delivery indication does not 
by itself uniquely differentiate between UDP_prime 
and UDP/IP packets, as any other IP packets can 
also have the TOS field set to priority. As a result, the 



Digital Technical Journal 



Vol. 7 No. 3 1995 



optimized receive processing has to check for the 
packer's adherence to the UDP_prime restrictions. 
Nonadherence to the restrictions reroutes the packet 
to the standard receive processing code. 

When a packet arrives at a network interface, the 
interface posts a hardware interrupt, and the interface 
interrupt service routine processes the packet. The 
standard interrupt service routine deletes the data link 
header, and hands the packet over to the netisr kernel 
thread." Netisr demultiplexes the packet based on 
the packet header contents and delivers it to the appli- 
cation's socket receive buffer. Netisr, designed to be 
a general-purpose packet demultiplexer, runs at a low- 
interrupt priority level. The main reason for a thread- 
based demultiplexer is extensibility. New protocol 
stacks can be registered to the thread. Since there is 
no a priori knowledge of the execution and SMP lock- 
ing requirements of these stacks, a thread-based low- 
interrupt priority demultiplexer is needed so that the 
network interrupt processing time can be held to a 
minimum. The extensibility feature, however, intro- 
duces a context switch overhead. 

For UDP_prime, the packet header processing time 
on the receive path is almost a small constant. We 
modified the interface service routine to demultiplex 
the packet by processing the data link, IP, and UDP 
headers, and deliver the packet to the socket receive 
buffer without handing it over to netisr. This short cir- 
cuit path is used only when the packet is a UDP/IP 
packet with no IP fragmentation and with priority 
delivery indication. If these conditions are not met, 
the standard netisr path is chosen. The UDPprime 
receive path eliminates the netisr context switch over- 
head. This is a significant advantage, especially when 
the receiving application runs with a real-time FIFO 
scheduling policy. 

SMP Synchronization 

One difficult)' in designing the UDP_prime stack to 
run in parallel with the standard UDP/IP stack was 
in SMP synchronization. 2 ' The socket buffer structure 
is a critical section guarded by a complex lock. 
Requesting a complex lock in Digital UNIX blocks 
execution if the lock is taken. To prevent deadlocks, 
its use is prohibited at an elevated priority level, such 
as the case in the interrupt service routine. To work 
around this problem, a new spin lock was introduced 
in the short circuit path and in the socket layer where 
access to the socket buffer needs to be synchronized. 

Performance 

To measure message-passing performance, we used 
two DEC 3000 Model 700 workstations connected by 
a GIGAswitch/FDDI system using TURBOchannel- 
based DEFTA full-duplex FDDI adapters. Each work- 



station contained a 225-megahcrtz (MHz) Alpha 
21064 microprocessor and was running the Digital 
UNIX version 3.0 operating system. 

Figure 6 shows the message- passing bandwidth for 
TCP/ IP, UDP/IP, and UDP_prime transports at dif- 
ferent message sizes. The bandwidth was measured at 
the message-passing application programmer interface 
(API) level, taking into account allocation and deallo- 
cation of each message buffer in addition to the data 
transmission. TCP/IP, UDP/IP and UDP_prime 
bandwidth peaks at approximately 95 megabits per 
second at a 4,224-byte message, approaching the 
FDDI peak bandwidth. UDP/IP approaches the peak 
bandwidth at a 1 ,400-byte message, and UDP_prime 
at a 1,024-byre message. Reaching the peak band- 
width using small messages is a measure of protocol 
processing efficiency. 

Figure 7 shows the minimum message-passing 
latency for TCP/IP, UDP/IP, and UDP_prime 
transports at different message sizes. The latency was 
measured as half of the minimum time to send a mes- 
sage and receive the same message looped by the 
receiver system over many iterations. The measure- 
ment made allowance for the allocation and deallo- 
cation of each message buffer, in addition to the 
round-trip transmission. 

Compared to the TCP/IP option, UDP/IP has a 
slightly higher minimum latency. This is not expected, 
because the original goal of the UDP/IP option vvasto 
reduce TCP/IP processing overhead. It is, however, 
encouraging to see only a slight degradation in latency 
when the reliable in-order delivery protocol is imple- 
mented at the library level. This prompted us to use 
the same protocol engine in the library for 
UDP_prime. At a very small message size (4 bytes), 
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protocol processing overhead dominates the latency. 
At this point, UDP_ prime was 44 percent (103.5 
microseconds) better than TCP/IP, even though 
UDP/IP and UDP_prime use the same mechanism. 

As the message size increases, the protocol processing 
rime remains constant, bur the data copy time becomes 
dominant. Despite this, UDP_prime w as approximately 
1 2 percent better at a 4-kilobyte message. 

Future Work 

The current communication options along with the 
UDP_prime optimization provide good performance 
for HPF-style message passing on SM P .systems and 
clusters. To remain competitive, however, we need to 
consider support for new high-performance commu- 
nication media and configurations. We are working on 
support for MEMORY CHANNEL, the use of multi- 
ple interconnects and protocols within an application 
running on a cluster of SMPs, and lightweight proto- 
cols for use with ATM at speeds of 622 megabits per 
second and higher. The flexibility of the message pass- 
ing design will allow current applications to use future 
communication options without relinking. 

We are also working on a new HPF debugger tech- 
nology. Debugging a clustcr-stvle HPF program is 
considerably harder than debugging a uniprocessing 
program. HPF's single-program multiple-data (SI'MD) 
parallel programming model includes a single- 
threaded control structure, a global name space, and 
loosely synchronous parallel execution. HPF also sup- 
ports the calling or extrinsic procedures that use other 
parallel programming srvles or nonparallcl computa- 
tional kernels. 



The goal of an HPF debugger is ro present the 
application in source-level terms. Since HPF is rougblv 
Fortran 90 with data-distribution directives, HPF is 
conceptually a single-threaded application with the 
compiler transforming pieces of the application to exe- 
cute in parallel. As a result, an HPF debugger has ro 
take the states from the actual peer processes and 
recreate a single source-level view of the application. It 
is not always possible to do this with complete preci- 
sion. Consider the user interrupting the application, 
which interrupts the peer processes at different points 
within the computation. It is unlikely each peer is at 
the same place (e.g., the same program statement), 
and it is quite likelv that the stack backtraces of the 
peers differ! Even if they are ar the same place, thev 
could be in different iterations of their local portions 
of a parallelized loop-like operation. 

Ar the start of the HPF debugger project, we sur- 
veyed a variety of debuggers and disqualified all of 
them for logistical and/or technical reasons. Rather 
than modify an existing debugger technology so that 
it could debug cluster-style HPF programs, we initi- 
ated an effort to build a new debugger technology. 
As we continue to design the new HPF debugger to 
be general-purpose, portable, and extensible, we will 
be able to capitalize on modern programming con- 
cepts, paradigms, and techniques. 

Summary 

PSE contains the tools and execution environment to 
debug, tune, and deploy parallel applications written 
in the HPF language. From an end user's perspective, 
PSE prov ides transparency, flexibility, and compati- 
bility with Familiar tools. Using standard UNIX com- 
mand syntax, the same executable can be run serially 
or in parallel on hardware ranging from a single-node 
system to a cluster of SMP systems. PSE supports sex - 
oral high-performance message-passing protocols run 
ning over a variety of network media. From a system 
administrator's perspective, PSE provides the flexibil- 
ity to create a cluster from standard components and 
to control the cluster by assigning access controls and 
setting scheduling policy and priorities. Although it 
currently supports only the HPF language, PSE has 
the flexibility and generic infrastructure to support 
other parallel languages and programming models. 
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Michael Stonebraker 



An Overview of the 
Sequoia 2000 Project 



The Sequoia 2000 project is the joint effort 
of computer scientists, earth scientists, gov- 
ernment agencies, and industry partners to 
build a better computing environment for 
global change researchers. The objectives of 
this widely distributed project are to support 
high-performance I/O on terabyte data sets, 
to put all data in a database management 
system, and to provide improved visualization 
tools and high-speed networking. The partici- 
pants developed a four-level architecture to 
meet these objectives. Chief among the lessons 
learned is that the Sequoia 2000 system must 
be considered an end-to-end solution, with all 
pieces of the architecture working together. 
This paper describes the Sequoia 2000 project 
and its implementation efforts during the first 
three years. The research was sponsored by 
Digital Equipment Corporation. 



The purpose of the Sequoia 2000 project is to build a 
better computing environment for global change 
researchers, hereafter referred to as Sequoia 2000 
clients. These researchers investigate issues such as 
global warming, ozone depletion, environment toxifi- 
cation, and species extinction and are members of 
earth science departments at universities and national 
laboratories. A more detailed conception for the proj- 
ect appears in the Sequoia 2000 technical report 
"Large Capacity Object Servers to Support Global 
Change Research." 1 

The participants in the Sequoia 2000 project are 
investigators of four tvpes: (1) computer science 
researchers, (2) earth science researchers, (3) govern- 
ment agencies, and (4) industry partners. 

Computer science researchers arc responsible for 
building a prototype environment that better serves 
the needs of the target clients. Participating in 
the Sequoia 2000 project are investigators from the 
Computer Science Division at the University of 
California, Berkeley; the Computer Science Depart- 
ment at tine University of California, San Diego; the 
School of Library and Information Studies at the 
University of California, Berkeley; and the San Diego 
Supercomputer Center. 

Earth science researchers must explain their needs 
to the computer science investigators and use the 
resulting prototype environment to perform better 
earth science research. The Sequoia 2000 project 
comprises earth science investigators from the 
Department of Geography at the L'niversitv of 
California, Santa Barbara; the Atmospheric Science 
Department at the University of California, Los 
Angeles (UCLA); the Climate Research Division at 
the Scripps Institution of Oceanography; and the 
Department of Earth, Air, and Water at the University 
of California, Davis. 

To ensure that the resulting computer environment 
addresses the needs of the Sequoia 2000 clients, gov- 
ernment agencies that are affected by global change 
matters participate in the project. The responsibility of 
these agencies is to steer Sequoia 2000 research 
toward achieving solutions to their problems. The 
government agencies that participate are the State of 
California Department of Water Resources (DWR), 
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the Stare of California Department of Forestry, the 
Coordinated Environment Research Laboratory 
(CERE) of the United Stares Army, the National 
Aeronautics and Space Administration (NASA), the 
National Oceanic and Atmospheric Administration 
(NOAA), and the United States Geologic Survey 
(USGS). 

The task of the industry participants is to use 
the Sequoia 2000 technology and to offer guidance 
and research direction. In addition, thev are a source 
of free or discounted computing equipment. Digital 
Equipment Corporation was the original indus- 
try partner. Recently, Epoch Svstcms, Hewlett- 
Packard, Hughes, Illtistra, MCI, Metrum Swstcms, 
PictureTel, RSI, SAIC, Siemens, and TRW have 
become participants. 

The purpose of this paper is to present the goals of 
the Sequoia 2000 project and to discuss how we 
achiev ed these goals and the results we accomplished 
during the first three years. The paper describes the 
architecture that we decided to pursue and the state of 
the software efforts in the various areas. The most 
important lesson we have learned is rhar the Sequoia 
2000 system must be considered an end-to-end solu- 
tion. Hence, clients can be satisfied onlv if all pieces of 
the architecture work together in a harmonious fash- 
ion. Also, many services required by the clients must be 
provided by every clement of the architecture, each 
working with the others. We illustrate this end-to-end 
characteristic of Sequoia 2000 by discussing three 
issues that cross all parts of the system: guaranteed 
deiiverv, abstracts, and compression. We then indicate 
other specific lessons that we learned during the first 
three years of the project. The paper concludes with the 
current state of the project and its future directions. 

The Sequoia 2000 Architecture 

The Sequoia 2000 architecture is motivated bv four 
fundamental computer science objectives: 

I. Support high-performance I/O on terabyte data 
sets. The Sequoia 2000 clients are frustrated by cur- 
rent computing environments because thev cannot 
effectively store the massive amounts of data 
desired for research purposes. The four academic 
clients plus DWR collectively want to be able to 
store approximately 100 terabytes of information, 
much of which is common data sets used by multi- 
ple investigators. These clients would like high- 
performance svsrem software that would allow 
sharing of assorted terriarv memory devices. Unlike 
the I/O activities of most other scientific comput- 
ing users, their activity involves primarily random 
access For example, DWR is digitizing the agency's 
library of 500,000 slides and is putting it on-line 
using the Sequoia 2000 system. This data set has 



some locality of reference but will have consider- 
able random activity. 

2. Pui all data in a database management system 
ihliMS). To maintain the metadata that describe 
their data sets and thus aid in the retriev al of infor- 
mation, the Sequoia 2000 clients want to move 
all their data to a DBMS. More important, using 
a DBMS will facilitate the sharing of information. 
Because a DBMS insists on a common schema for 
shared information, it will allow the researchers to 
define a schema. Then all researchers must use 
a common notation for shared data. Such a svstem 
will be a big improvement over the current situa- 
tion where every data set exists in a unique format 
anil must be converted by every researcher who 
w ishes to use it. 

3. Proride improved eisiializatiun tools. Sequoia 
2000 clients use popular scientific visualization 
tools such as Explorer, Ivhoros, AVS, and IDE and 
are eager to use a next-generation toolkit. 

4. I'roride high-speed networking. Sequoia 2000 
clients realize that a 100-terabvte storage server (or 
100- terabyte servers) will not be located on each of 
their desktops. Moreover, the storage is likely to be 
located at the other end of a wide area network 
(WAN), far from their client machines. Since the 
clients 1 visualization scenarios invariably involve 
animation, for example, showing the last 10 vears 
of the ozone hole by playing time forward, the 
clients require ulrrahigh-speed networking to move 
sequences of images from a server machine to 
a client machine. 

To meet these objectives, we adopted the four-level 
architecture illustrated in Figure 1. The architecture 
comprises the footprint layer, the file system layer, the 
DBMS laver, and the application layer. This section 
discusses our efforts at each of the levels and then con- 
cludes with a discussion of the Sequoia 2000 network- 
ing that connects the elements of the architecture. 

The Footprint Layer 

The footprint laver is a software svstem that shields 
higher-level software, such as file systems, from device- 
specific characteristics of robotic dev ices. These charac- 
teristics include specific robot commands, block sizes, 
and media-specific issues. The footprint laver can be- 
thought of as a common robot device driv er. A foot- 
print implementation exists for each of the four terriarv 
memory devices used bv the project, namely, a Sony 
write once, read many (WORM) optical disk jukebox, 
an HP rewritable optical disk jukebox, a Metrum V'HS 
tape jukebox, and an Exabyte 8-millimeter rape juke- 
box. Collectively, these four devices and the CPUs and 
disk storage systems in front of them were named 
Bigloot, after the legendary, verv tall recluse sported 
occasionally' in the Pacific Northwest. 
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The Sequoia 2000 Architecture 



The File System Layer 

On top of the footprint layer is the file system layer. 
Two file systems manage data in the Bigfoot multilevel 
memory hierarchy. The first file system is Highlight, 
which extends the Log-structured File System (LPS) 
pioneered for disk devices by Ousterhout and 
Rosenblum to tertiary memory.' 5 The original LFS 
treats a disk device as a single continuous log onto 
which newly written disk blocks are appended. Blocks 
are never overwritten, so a disk device can always be 
written sequentially. Hence, the LFS turns a random- 
write environment into a sequential -write environ- 
ment. In particular problem areas, this may lead to 
much higher performance. Benchmark data support 
this conclusion/ In addition, the LFS can always iden- 
tity the last few blocks that were written prior to a file 
system failure by finding the end of the log at recovery 
rime. File system repair is then very fast, because 
potentially damaged blocks are easily found. This 
approach differs from conventional file system repair, 
where a laborious check of the disk must be performed 
to ascertain disk integrity. 

Highlight extends the LFS to support tertiary mem- 
ory by adding a second log-structured file system on 
top of the footprint layer. This file system also writes 
tertiary memory blocks sequentially, thereby obtain- 
ing the performance characteristics of the LFS. The 
Highlight file system adds migration and bookkeeping 
code that treats the disk LFS tile system as a cache for 
the tertiary memory file system. In summary, 
Highlight should provide good performance for 
workloads that consist of mainly write operations. 
Since Sequoia 2000 clients want to archive vast 



amounts of data, the Highlight file system has the 
potential for good performance in the Sequoia 2000 
environment. 

The second file system is Inversion.' Most DBMSs, 
including the one used for the Sequoia 2000 project, 
support binary large objects (BLOBs), which are 
arbitrary-length byte strings of variable length. Like 
several commercial systems, Sequoia's data manager 
POSTGRES stores large objects in a customized 
storage system directly on a raw storage device. 6 As 
a result, it is a straightforward exercise to support con- 
ventional files on top of DBMS large objects. In this 
way, the front end turns every read or write operation 
into a query or an update, which is processed directly 
by the DBMS. Simulating files on top of DBMS large 
objects has several advantages. First, DBMS services 
such as transaction management and security are auto- 
matically supported for files. In addition, novel charac- 
teristics of our next-generation DBMS, including time 
travel and an extensible type system for all DBMS 
objects, are automatically available for files. Of course, 
the possible disadvantage of simulating files on top of 
a DBMS is poor performance. As reported bv Olson, 
Inversion performance is exceedingly good when large 
blocks of data are read and written, as is characteristic 
of the Sequoia 2000 workload. 5 

At the present time, Highlight is operational but 
very buggy. Inversion, on the other hand, is used to 
manage production data on Sequoia's Sony WORM 
jukebox. Unfortunately, the reliability of the proto- 
type system has not met user expectations. Sequoia 
2000 clients have a strong desire for commercial off- 
the-shelf (COTS) software and are frustrated by docu- 
mentation glitches, bugs, and crashes. 

As a result, the Sequoia 2000 project team has also 
deployed two commercial file systems, Epoch and 
AMASS. The Epoch file system is quite reliable but 
does not support either of Sequoia's large-capacity 
robots. Hence, it is used heavily bur only for small data 
sets. The AMASS file system is just coming into pro- 
duction use for Sequoia's Metrum robot and replaces 
an earlier COTS system, which was unreliable. Given 
the experience of the Sequoia 2000 team with tertiary 
memory support, tertiary memory users should care- 
fully test all file system software. 

The DBMS Layer 

To meet Sequoia 2000 client needs, a DBMS 
must support spatial data such as points, lines, and 
polygons. In addition, the DBMS must support the 
large spatial arrays in which satellite imagery is natu- 
rally stored. These characteristics are not met by pop- 
ular, general-purpose relational and object-oriented 
DBMSs. 7 The best fit to client needs is a special- 
purpose Geographic Information System (GIS) or 
a next-generation object-relational DBMS. Since it 
has one such object-relational system, namely 
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POSTGRES, the Sequoia 2000 project elected to 
focus its DBMS efforts on this svstem. 

To make the POSTGRES DBMS suitable for 
Sequoia 2000 use, we require a schema for all Sequoia 
data. This database design process lias evolved as a 
cooperative exercise between vario us database experts 
at Berkeley, the San Diego Supercomputer Center, 
GERL, and SAJC. The Sequoia schema is the collec- 
tion of metadata that describes the data stored in the 
POSTGRES DBMS on Bigfoot. Specifically, these 
metadata comprise 

■ A standard vocabulary of terms w ith agreed-upon 
definitions that are used to describe the data 

■ A set of tvpes, instances of which mav store data 
values 

■ A hierarchical collection of classes that describe 
aggregations of the basic types 

■ Functions defined on the types and classes 

The Sequoia 2000 schema accommodates four 
broad categories of data: scalar, vector, raster, and text. 
Scalar quantities are stored as POSTGRES types and 
assembled into classes in the usual wav. Vector quanti- 
ties arc stored in special line and polygon types. 
Vectors are fullv enumerated (as opposed to an arc- 
node representation) to take advantage of POSTGRES 
indexed searches. The advantages of this representa- 
tion are discussed in more detail in " The Sequoia 
2000 Benchmark "~ 

Raster data constitute the bulk of the Sequoia 2000 
data. These data are stored in POSTGRES multi- 
dimensional arrays •bjects. The contents of textual 
objects (in PostScript or scanned page bitmaps) are 
stored in a POSTGRES document type. Both docu- 
ments and arrays make use of a POSTGRES large 
object storage manager that can support arbitrary- 
length objects. 

VVe have tuned the POSTGRES DBMS to meet 
the needs of the Sequoia 2000 clients. The interface 
to POSTGRES arrays has been improved, and a novel 
chunking strategy is now operational." Instead of 
storing an array by ordering the array indexes from 
fastest changing to slowest changing, this system 
chooses a stride f»r each dimension and stores chunks 
of the correct stride sizes in each storage object. When 
user queries inspect the array in more than one way, 
this technique results in dramatically superior retrieval 
performance. 

Sequoia 2000 clients tvpical.lv run queries with user- 
defined functions in the predicate. Moreover, many 
of the predicates are very expensive in CTU time to 
compute. For example, the Santa Barbara group has 
written a function, SNOW, that recognizes the snow- 
covered regions in a satellite image. It is a user-defined 
POSTGRES function that accepts an image as an argu- 
ment and returns a collection of polygons. A typical 



querv using the SNOW function for the table 
IMAGES (id, date, content) would be to find the 
images that w ere more than 50 percent snow and that 
w ere observed subsequent to June 1992. In SQL, this 
querv is expressed as follows: 

select id 
from IMAGES 

where AREA (SNOW (content)) > 0.5 
and date > "June 1, 1992" 

The first clause in the predicate requires the CPU to 
evaluate millions of instructions, whereas the second 
clause requires only a lew hundred instructions. The 
DBMS must be cognizant of the CPU cost of clauses 
when constructing a querv plan, a cost component 
that has been ignored bv most previous optimization 
work. We have extended the POSTGRES optimizer to 
deal intelligently with expensive functions.'' 

It is highly desirable to allow popular expensive 
functions to be precompiled. In this way, the GPU 
need only evaluate each such function once, rather 
than once for each query in which the function 
appears. Our approach to this issue is to allow data- 
bases to contain indexes on a function of the data and 
not on j li st the data object itself. Hence, the database 
administrator can specify that a B-tree index be built 
for the function AREA (SNOW( content )). Areas of 
images are arranged in sort order in a B-tree, so the 
first clause in the above querv is now very inexpensive 
to compute. Using this technique, the function is 
computed at data entry or data update time and not at 
query evaluation time. A consequence of function 
indexing is that inserting a new image into the data- 
base may be very time-consuming, since function 
computation is now included in the load transaction. 
To deal with the undesirable lengthy response times 
for some loads, we have also explored lazy indexing 
and partial indexing. Thus, index building does not 
need to be synchronous w ith data loading. 

The feedback from the Sequoia 2000 clients regard- 
ing POSTGRES is that it is not reliable enough to 
serve as a base lor production work. Moreover, the 
documentation is inadequate, and no facility exists to 
train users. Our users want a COTS product and not 
a research prototype. Consequently, the Sequoia 2000 
project has migrated to the commercial version of 
POSTGRES, namely the Illustra system, to obtain a 
COTS DBMS product. Migration to this system 
required reloading all project data, a task that is now 
nearly complete. 

The Application Layer 

The application layer of the Sequoia 2000 architecture 
contains five elements: 

1 . An off-the-shelf visualization tool 

2. A visualization environment 
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3. A browsing capability for textual information 

4. A facility to interface the UCLA General Circula- 
tion Model (C.CM) to the POSTGRES/II lustra 
svstem 

5. A desktop videoconferencing or "picturephonc" 
facility 

For the off-the-shelf visualization tool, we have 
converged around the use of AVS and IDL for project 
activities. AVS has an easy-to-use "boxes-and-arrows" 
user interface, whereas IDL has a more conventional 
linear programming notation. On the other hand, 
IDL has better two-dimensional (2-D) graphics fea- 
tures. Both AVS and IDL allow the user to read and 
write file data. To connect to the DBMS, we have writ- 
ten an AVS-POSTGRES bridge. This program allows 
the user to construct an ad hoc POSTGRES query and 
pipe the result into an AVS boxes-and-arrows network. 
Sequoia 2000 clients can use AVS for further process- 
ing on any data retrieved from the DBMS. IDL is 
being interfaced to AVS by the vendor. Consequently, 
data retrieved from the database can be moved into 
IDL using AVS as an intermediary. Now that we have 
migrated to the 1 1 lustra DBMS, we are considering 
porting this AVS bridge to the Illustra application pro- 
gramming interface (API ). 

AVS has some disadvantages as a visualization tool 
for Sequoia 2000 clients. First, its type system, which 
is different from the POSTGRKS/Illustra type system, 
has no direct knowledge of the common Sequoia 
2000 schema. In addition, AVS consumes significant 
amounts of main memorv. Architecturally, AVS 
depends on virtual memory to pass results between 
various boxes. It also maintains the output of each box 
in virtual memory for the duration of an execution ses- 
sion. The user can thus change a run-time parameter 
somewhere in the network, and AVS will recompute 
only the downstream boxes bv raking advantage of the 
previous output. As a result, Sequoia 2000 clients, 
who generally produce very large intermediate results, 
consume large amounts of both virtual and real mem- 
orv. In fact, clients report that 64 megabytes of real 
memory on a workstation is often not enough to 
enable serious AVS use. Furthermore, AVS does nor 
support zooming in to investigate data of interest to 
obtain higher resolution, nor does it keep track of the 
history of how any given data element was con- 
structed, i.e., the so-called data lineage of an item. 
Lastly, AVS has a video player model for animation 
that is too primitiv e for many Sequoia 2000 clients. 

Consequently, we have designed two new visualiza- 
tion environments. The first system, called Tecate, is 
being built at the San Diego Supercomputer Center. 
The Tecate infrastructure enables the creation of appli- 
cations that allow end users to browse for and visualize 
data from networked data sources. This software 



platform capitalizes on the architectural strengths of 
current scientific visualization systems, network 
browsers, database management system front ends, 
and virtual rcalitv systems, as discussed in a companion 
paper in this issue of the Journal. 1 " 

The other svstem, Tioga, is a boxes-and-arrows pro- 
gramming environment that is DBMS-centric, i .e., the 
environment type system is the same as the DBMS 
type system. The Tioga user interface gives the user 
a flight simulator paradigm for browsing the output 
of a network. In this way, the visualizer can navigate 
around data and then zoom in to obtain additional 
data on items of particular interest. The preliminary 
Tioga design was presented at the 1993 Very Large 
Databases Conference." A first prototype, described 
by Woodruff, is currently running. 12 

A commercial version of the Tioga environment has 
also been implemented by Illustra. The Sequoia 2000 
project is making considerable use of this tool, which is 
named Object- Knowledge. Earlv user experience with 
both Tioga and Object- Knowledge indicates that these 
systems are nor easy to use. We are now exploring 
ways to improve the Tioga system. The objective is to 
build a svstem that a scientist with minimal training in 
the environment can use without a reference manual. 

The third element of the application lavcr is a 
browsing capability for textual information of interest 
to our clients. This capability is a cornerstone of the 
Sequoia 2000 architecture. Initially, we converted a 
stand-alone text retrieval system called Lassen to our 
DBMS-centric view. The first part of the Lassen system 
is a facility for constructing weighted keyword indexes 
for the words in a POSTGRES document. This index- 
ing system, Cheshire, builds on the pioneering work of 
the Cornell Smart system and operates as the action 
part of a POSTGRES rule, which is triggered on each 
document insertion, update, or removal. 1 15 The sec- 
ond part of the Lassen system is a front-end querv tool 
that understands natural language. This tool allows 
a user to request all documents that satisfy a collection 
of keywords by using a natural language interface. The 
Lassen system has been operational for more than 
a year, and retrievals can be requested against the cur- 
rently loaded collection of Sequoia 2000 documents. 

In addition, we have moved Lassen to Z39.50, 
a popular protocol oriented toward information inter- 
change and information retrieval." The client portion 
of Lassen has been changed to emit Z39.50, and 
we have written a Z39.5()-ro-POSTGRES translator 
on the server side. In this way, the Lassen client code 
can access non-Sequoia 2000 information and the 
Sequoia 2000 server can be accessed by text-retrieval 
front ends other than the Cheshire svstem. 

With our move to the Illustra DBMS, we have con- 
verted the client side of Lassen to work with Illustra. 
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Illustra has an integrated document dara type with 
capabilities similar to the extensions we made to 
POSTG11ES. 

A related Berkeley project is focused on digitizing 
all the Berkeley Computer Science Technical Reports. 
This project uses a Mosaic client to access a custom 
World Wide Web server called Dienst, which stores 
technical report objects in a UNIX file svstcm. In a few 
months, we expect ro conv ert Dienst to store objects 
in the Sequoia 2000 database, rather than in files. 
When this svstcm, nicknamed Database Dienst, is 
operational, Mosaic/ Dienst service will be available 
for all textual objects in the Sequoia schema. 

Our fourth thrust in the application lavcr is a facility 
to interface the UCLA General Circulation Model 
(GCM) to the POSTGRES/ Illustra system. This pro- 
gram is a "data pump" because it pumps data out of 
the simulation model and into the DBMS. We named 
the program "the lug lift" after the DWR pumping 
station that raises Northern California water over the 
Tehachapi Mountains into Southern California. 

Basically, the U( TA GCM produces a vector of sim- 
ulation output variables for each time step of a lengthy 
run for each rile in a three-dimensional (3-D) grid of 
the atmosphere and ocean. Depending on the scale 
of the model, its resolution, and the capability of the 
serial or parallel machine on which the model is run- 
ning, the UCLA GCM can produce from 0.1 to 10.0 
megabytes per second (MB/s) output. The purpose of 
the big lift is ro install the output data into a database 
in real time. UCLA scientists can then use Object- 
Knowledge, Tioga, Tecafe, AVS, or 1DL ro visualize 
their simulation output. The big lift will likely have to 
exploit parallelism in the data manager, if it is required 
ro keep up with the execution of the model on a mas- 
sively parallel architecture. 

The fifth application svstem is a conferencing sys- 
tem. Since Sequoia 2000 is a distributed project, we 
learned early that face-to-face meetings that required 
participants to travel to other sites and electronic mail 
were not sufficient to keep project members working 
as a team. Consequently, wc purchased conference 
room videoconferencing equipment for each project 
sire. This technology costs approximately S50,()0() per 
site and allows multiwav videoconferences over inte- 
grated services digital network (ISDN) lines. 

Although the conference room equipment has 
helped project communication immensely, it must be 
set up and taken down at each use because the rooms 
it occupies at the various sires are normally used as 
classrooms. Therefore, videoconferencing tends ro be 
used for arranged conferences and nor for spur-of-the- 
moment interactions. To alleviate this shortcoming, 
Sequoia 2000 has also invested in desktop videocon- 
ferencing. A video compression board, a microphone, 
speakers, a network connection, a v ideo camera, and 



the appropriate software can rurn a conventional 
workstation into a desktop videoconferencing facility. 
In addition, video can be easily transmitted over the 
network interface that is present in virtually all Sequoia 
2000 client machines. We are using the Mbone soft- 
ware suite to connect about 30 of our client machines 
in this fashion and are migrating most of our video- 
conferencing activities to desktop technology. This 
effort, which is called Hollywood, strives ro further 
improve the ability of Sequoia 2000 researchers ro 
communicate. 

Note that the Sequoia 2000 researchers do nor 
need groupware, i.e., the ability to have common win- 
dows on multiple client machines separated bv a WAN, 
in which common code can be run, updated, and 
inspected. Rather, our researchers need a vvav to hold 
impromptu discussions on project business. Thev 
want a low-cost multicast picturephone capability, and 
our desktop videoconferencing cf'fbrrs are focused in 
this direction. 

Sequoia 2000 Networking 

The last topic of tins section on the Sequoia 2000 
architecture is the networking agenda. Regarding 
Figure 1, it is possible for the implementation of each 
lavcr ro exist on a different machine. Specifically, the 
application can be remote from the DBMS, which can 
be remote from the flic svstem, which can be remote 
from the storage device. Each layer of the Sequoia 
2000 architecture assumes a local UNIX socket con- 
nection or a local area network (I .AN) or WAN connec- 
tion using the transmission control protocol/internet 
protocol (TCP/IP). Actual connections among 
Sequoia 2000 sites use either the Internet or a dedi- 
cated T3 network, w hich the Univ ersity of California 
provides as part of its contribution to the project. 

The networking ream judged Digital's Alpha 
processors ro be fast enough to route T3 packets. 
Hence, the project uses conventional workstations as 
routers; custom machines are nor required. Fur- 
thermore, the Sequoia 2000 network has installed 
a unique guaranteed deliv ery service through which 
an application can make a contract with the network 
that will guarantee a specific bandwidth and latency if 
the client sends information at a rate that does not 
exceed the rate specified in the contract. These algo- 
rithms, which are based on rhe work of Ferrari, require 
a setup phase for a connection that allocates band- 
width on all the lines and in all the switches. 15 

Lastly, the network researchers are concerned that 
the Digital UNIX (formerly DHC OSF/I) operating 
svstem copies every bvrc four rimes in between rerriev - 
ing it f rom rhe disk and sending it out ov er a netw ork 
connection. The efficient integration of networking 
services into the operating svstem is the topic of 
a companion paper bv Pasquale et al. in this issue. 1 " 
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Sequoia 2000 as an End-to-End Problem 

The major lesson we have learned from rhe Sequoia 
2000 project is that manv issues facing our clients can- 
not be isolated to a single layer of the Sequoia 2000 
architecture. T his section describes three such end-to- 
end problems: guaranteed delivery, abstracts, and 
compression. 

Guaranteed Delivery 

Clearly, guaranteed delivery must be an end-to-end 
contract. Suppose a Sequoia 2000 client wishes to v isu- 
alize a specific computation; tor example, the client 
wants to observe Hurricane Andrew as it moves from 
the Bahamas to Florida to Louisiana. Specifically, the 
client wishes to visualize appropriate satellite imagers - at 
a resolution of 500 X 500 in 8-bit color at 10 frames 
per second. Hence, the client requires 2.5 MB/s of 
bandwidth to his screen. The following scenario might 
be the computation steps that take place. 

The OHMS must run a query to fetch the satellite 
imagery. The query might require returning a 16-bit 
data value for each pixel that will ultimately appear on 
the screen. The DBMS must therefore agree to exe- 
cute the querv in such a way that it guarantees output 
at a rare of 5,0 MB/s. 

The storage system at the server will fetch some 
number of I/O blocks from secondary and/or tertiary 
memory. DBMS querv optimizers can accurately guess 
how manv blocks rhev need to read to satisfy the 
querv. The OBMS can then easily generate a guaran- 
teed delivery contract that the storage manager must 
satisfv, thus allowing the DBMS to satisfv its contract. 

The network must agree to deliver 5.0 MB/s °ver 
the network link that connects the client to the server. 
The Sequoia 2000 network software expects exactly 
this type of contract request. 

The visualization package must agree to translate 
the 16-bit data element into an 8-bit color and render 
the result onto the screen at 2.5 MB/s. 

In short, guaranteed deliv ery is a collection of con- 
tracts that must be adhered to by rhe DBMS, the 
storage system, the network, and the visualization 
package. One approach to architccting these contracts 
was presented at the 1993 Very Large Databases 
Conference," 

Abstracts 

One aspect of the Sequoia 2000 visualization process 
is rhe necessity of abstracts. Consider rhe Hurricane 
Andrew example. The client might initially want to 
browse rhe hurricane at 100 X 100 resolution. Then, 
on finding something of interest, the client would 
probably like to zoom in and increase the resolution, 
usually to rhe maximum available in the original data. 
This ability to dynamically change the amount of reso- 
lution in an image is supported bv abstracts. 



Note that providing abstracts is a much more pow - 
erful construct than merely providing for resolution 
adjustment. Specifically, obtaining more derail may 
entail moving from one representation to another. For 
example, one could have an icon for a document, 
zoom in to see the abstract, and then zoom in further 
to see the entire document. Hence, zooming can 
change from iconic to textual representation. This use 
of abstracts was popularized in the DBMS community 
by an enrlv DBMS visualization system called the 
Spatial Data Management System (SDMS). 1T 

Sequoia 2000 clients wish to have abstracts; how- 
ever, it is clear that they can be managed by the visual- 
ization tool, the DBMS, the network, or rhe file 
system. In the former case, abstracrs are defined for 
boxes-and-arrows networks." In the DBMS, abstracts 
would be defined for individual data elements or for 
data classes. If the network manages abstracts, it w ill 
use them to automatically lower resolution to elimi- 
nate congestion. Much research on rhe optimization 
of network abstracrs (called hierarchical encoding of 
data in that community) is available. In the file system, 
abstracrs would be defined for files. Sequoia 2000 
researchers are pursuing all four possibilities, and it is 
expected that this notion will be one of the powerful 
constructs to be used by Sequoia 2000 software, 
perhaps in multifile w ays. 

Compression 

The Sequoia 2000 clients are adamant on the issue of 
compression — rhev are open to anv compression 
scheme as long as it is lossless. As scientists, rhev 
believe that ultimate resolution may be required to 
understand future phenomena Since it is not possible 
to predict what these phenomena might be or where 
they might occur, rhe Sequoia 2000 scientists want 
access to all data at full resolution. 

Some Sequoia 2000 data cannot be compressed 
economically and should be stored in uncompressed 
form. The inclusion of abstracts offers a mechanism to 
lower rhe bandw idth required between the storage 
device and rhe visualization program. No saving of 
tertiary memory through compression is available for 
such data. 

Other data ought to be stored in compressed form. 
The question of when compression and decompres- 
sion should occur can be handled bv using a just-in- 
time decompression strategy. For example, if the 
storage manager compresses data as rhev are written 
and then decompresses them on a read operation, the 
network manager mav then recompress the data for 
transmission over a WAN to a remote site where they 
will be decompressed a second time. Obviously, data 
should be moved in compressed form and decom- 
pressed only when necessary. In general, decompres- 
sion will occur in the visualization system on the client 
machine. Ifsearch criteria are performed on the data, 
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then die OBMS may have to decompress rhe data to 
perform the search. If an application resides on the 
same machine as the storage manager, the file svstem 
must be in charge of decompressing the data. All soft- 
ware modules in the Sequoia 2000 architecture must 
cooperate to perform just-in-time decompression and 
as-early-as-possible compression. Like guaranteed 
delivery, compression is a task that requires all software- 
modules to cooperate. 

Specific Lessons Learned 

In addition to the end-to-end issues, we learned other 
lessons from the first three vears of the Sequoia 2000 
experience, as discussed in this section. 

Lesson 1: Infrastructure is necessary, time-consuming, 
and very expensive. 

We learned early in the project that electronic mail and 
travel between sites would not result in the desired 
degree of cooperation from geographically dispersed 
researchers from different disciplines. Consequently, 
we made a significant investment in infrastructure. 
This included meetings for all the Sequoia 2000 par- 
ticipants, which are now held twice a vear, and video- 
conferencing equipment at each site. Through this 
video link, project members interact by holding 
a weekly distributed seminar, semimonthly operations 
committee meetings, occasional steering committee 
meetings, and meetings between researchers with 
common interests. The video quality of the project's 
current v ideoconferencing equipment is not high, and 
to achieve success when participants are located far 
apart, specially trained individuals must operate the 
equipment. Nevertheless, the equipment has proven 
ro be valuable in generating cohesion in rhe dispersed 
project. We have installed desktop videoconferencing 
svstems on 30 Sequoia 2000 workstations and expect 
to replace our current conference room equipment 
with next-generation desktop technology. 

In addition, wc conducted a learning experiment in 
which a course taught by one of the Sequoia 2000 fac- 
ulty members at the Santa Barbara campus was broad- 
cast over our videoconferencing equipment to four 
other sites. Students could take the course for credit at 
their respective campuses. Of course, the ov erhead of 
setting up such a course was substantial. A new course 
had to be added at each campus, and every step in the 
approval process required briefings on the fact that rhe 
instructor was from a different campus and on rhe way 
everything was going to work. This experiment was 
popular, and students have requested additional 
courses taught in this manner. 

On the other hand, we also tried to run a computer 
science colloquium using this technology. We broad- 
cast from various sites to six compute!' science depart- 
ments around the U.S. Initial student interest was high 



because of rhe lineup of eminent speakers. Such speak- 
ers could be recruited easily, because they only had to 
locate the nearest compatible equipment and then get 
to that site. No air travel was required. The experiment 
failed, howev er, because attendance decreased through- 
out the semester and ended at an extremely low level. 

The basic problem was that, typically, speakers were 
not skilled in using the medium — they would put too 
much information on slides and then flip though the 
slides before remote sires could get a complete trans- 
mission. Also, the question-and-answer period could 
not be very interactive because of rhe many sites 
involved. The experiment ended after one semester 
and will not be repeated. 

Lesson 2: There was often a mismatch between the 
expectations of the earth scientists and those of the 
computer scientists. 

The computer scientists on the Sequoia 2000 team 
wanted access to knowledgeable application specialists 
who could describe their problems in terms under- 
standable to the compute!' scientist. The computer 
scienrisrs then wanted to think through elegant solu- 
tions, verify with the earth scientists that the solutions 
were appropriate, and then prototype rhe results. The 
earth scientists wanted final COTS solutions to their 
problems; they were unsympathetic about poor docu- 
mentation, bugs, and crashes. 

With considerable effort, the expectations are con- 
verging. The ultimate solution is to move to COTS 
software modules as they become available for por- 
tions of the svstem and augment the modules w ith 
in-house prototype code. 

We have found that the best way to make forward 
progress was to ensure that each earth science group 
using Sequoia 2000 prototype code had one or more 
sophisticated staff programmers who could deal 
successfully with the quirks of" prototype code. With 
compute!' science expertise surrounding rhe earth sci- 
entists, the problems in this area became much more 
manageable. We also discovered that we could distrib- 
ute such expertise. In fact, support programmers for 
Sequoia 2000 clients arc often not at the same physical 
location as the client. 

Lesson 3: Interdisciplinary research is fundamentally 
difficult. 

One lengthy discussion on the construction of a 
Sequoia 2000 benchmark eventually led to the discus- 
sion presented in the 1993 ACM SIGMOD conference 
paper entitled "The Sequoia 2000 Benchmark, 11 
which we referred to previously. The compute!' sci- 
ence researchers were arguing strongly for a represen- 
tative abstract example of earth science data access, 
i.e., the "specmark of earth science." On the other 
hand, rhe earth scientists were equally adamant that 
the benchmark convey rhe exact data accesses. 
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Finally, the computer scientists and the earth scien- 
tists realized that the word "benchmark" has a different 
meaning for each of the two groups of researchers. To 
earth scientists, a benchmark is a scenario, whereas to 
computer scientists, a benchmark is an abstract exam- 
ple. This vignette was apical of the experience these 
two disciplines had trying to understand one another. 
Fundamentally, this process is time-consuming, and 
ample interaction time should be planned for any proj- 
ect that must deal with multiple disciplines. 

The Sequoia 2000 project participants made effec- 
tive use of "converters." A converter is a person of one 
discipline w ho is planted directly in the research group 
of another discipline. Through informal communica- 
tion, this person serves as an interpreter and translator 
for the other discipline. Converters are encouraged by 
the existence of a formal exchange program, whereby 
central Sequoia 2000 resources pay the living expenses 
of the exchange personnel. 

Lesson 4: Database technology is a major advance for 
earth scientists. 

Our initial plan was to introduce database technology 
into the project with the expectation that the earth sci- 
entists would pick it up and use it. Unfortunately, they 
are accustomed to data being in files and found it verv 
difficult to make the transition to a database view. The 
earth scientists are becoming increasingly aware of 
the inherent advantages of DBMS technology. 

In addition, we appointed the earth scientist with 
the most computer science knowledge as leader of the 
database design effort. This person chaired a commit- 
tee of mainly computer scientists who were charged 
with producing a schema. 

This technique failed for several reasons. First, the 
computer scientists disagreed about whether we were 
designing an interchange format, by which sites could 
reliably exchange data sets (i.e., an on-the-wire repre- 
sentation), or a schema for stored data at a site. Most 
earth science standards, such as the Hierarchical Data 
Format (HDF) and the network Common Data Form 
(nctCDF), are of the first form, and there was substan- 
tial enthusiasm for simply choosing one of these for- 
mats."- 1 '' On the other hand, some computer scientists 
argued that an on-the-wire representation mixes the 
data (e.g., a satellite image) and the metadata that 
describe it (e.g., the frequency of the sensor, the dare 
of the data collection, and the name of the satellite) 
into a single, highly encoded bit string. A better design 
would separate the two kinds of data and construct 
a good stored schema for it. 

A second problem was that numerous legacy 
formats are currently in use, and some earth scientists 
did not want to change the formats thev were using. 
This led to many arguments about the merits of one 
legacy format over another, which in turn caused the 



opposing sides to conclude that both formats under 
discussion should be supported in addition to a neu- 
tral representation. 

A third problem was that earth science data are fun- 
damentally quite complex. For example, earth scien- 
tists store geographic points, which are 3-D positions 
on the earth's surface. There are approximately 20 
popular projections of 3-D space onto 2-D space, 
including (latitude, longitude), Mercaror projection, 
and Lambert Equal Azimuthal projection. With every 
instance of a geographic point, it is necessary to associ- 
ate the projection system that is being used. Another 
dataproblem is related to units. Some geographic data 
are represented as integers, with miles as the funda- 
mental unit; other data are represented as floating- 
point numbers, with meters as the underlying unit. 
In addition, satellite imagery must be massaged in 
a variety of ways to "cook" it from raw data into 
a usable form. Cooking includes converting imagery 
from a one-dimensional stream of data recorded in 
satellite flight order into a 2-D representation. Many 
details of this cooking process must be recorded for all 
imagery. This dramatically expands the metadata 
about imagery as well as forces the earth scientist to 
write down all the extra data elements. 

Schema design turned out to be laborious and verv 
difficult. The earth scientists did not have a good 
understanding of database design and thus were not 
prepared to take on the extreme complexity of the 
task. As a result, we have reconstructed our database 
design effort. Now, two computer scientists are 
responsible for producing a schema. Thev interact 
with the earth scientists when such action helps to 
accomplish the task. 

Lesson 5: Project management is a substantial problem. 

Sequoia 2000 is a large project. About 110 people 
attended the last general meeting. The attendees 
included approximately 30 computer scientists, 40 
earth scientists, and 40 visitors from industry. Multiple 
efforts on multiple campuses must "plug and plav." 
Synchronizing distributed dev elopment is an extreme 
challenge. Furthermore, the skill of project manage- 
ment is not fostered in a university environment, nor 
is it rewarded in a university faculty evaluation. 

The principal investigators viewed the time spent 
on project management as time that could be better 
invested in research activities. An obvious solution 
would be for the Sequoia 2000 project to hire a pro- 
fessional project manager. Unfortunately, it is impos- 
sible to pay a n on faculty person the market rates 
normally received by such skilled persons. One strat- 
egy we attempted to use was to solicit a visitor with 
the desired skill mix from one of our industrial spon- 
sors. Our efforts in this direction failed, and we were 
never able to recruit project management expertise for 
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the Sequoia 2000 effort. Asa result, project manage- 
ment was performed poorly at best. In any future large 
project, this component should he addressed satisfac- 
torily up front by project personnel. 

Lesson 6: Multicampus projects are extremely difficult 
to implement. 

Sequoia 2000 work is taking place in seven different 
organizations within the University of California edu- 
cational system. There is a constant need to transfer 
money and people among these organizations. Accom- 
plishing such moves is a difficult and slow process, 
however, because of the bureaucracy within the sys- 
tem. In addition, the personnel rules of the University 
are often in conflict with the needs of the Sequoia 
2000 project. As a result, multi-institution projects, 
where participants are in different and often distant 
locations, are extremely difficult tocarrv our. 

Status and Future Plans 

The Sequoia 2000 project is more than three years old 
and has nearly accomplished its objectives. We have 
a common schema in place for all Santa Barbara and 
UCLA data, and all participants have agreed to use the 
schema. This schema serves as leverage for the stan- 
dards efforts under wav in the spatial arena.'" The 
infrastructure is in place to enable this schema to 
evolve as more data tvpes, user-defined functions, and 
operators are included in the future. 

The combination of Object-Knowledge, lllustra, 
Epoch, and AMASS is proving robust and meets our 
clients' needs. Lastly, we have ample resources to 
move our prototype into production use at UCLA ami 
Santa Barbara during the next several months. 

We are also extending the scope of the prototype in 
two different directions. First, wc will recruit addi- 
tional earth scientists to utilize our system. This will 
require extending our common schema to meet their 
needs and then installing our suite of software at their 
site. We expect to recruit two to three new groups 
during the next vear. 

Second, a companion project, the Klectronic 
Repository, has as one of its objectives to use the 
Sequoia 2000 technology to support an environmen- 
tal digital library of aerial photography, polygonal 
data, and text fbrthe Resources Agency of the State of 
California. 21 This electronic library project is extend- 
ing the reach of Sequoia 2000 technology from earth 
scientists toward a broader community. 

Our research activ ities are also very activ e. As noted 
earlier, we are continuing our visualization activities 
and anticipate an improved Tioga system. The 
Sequoia 2000 clients hav e made it clear that they want 
seamless access to distributed data, and we have 
evolved POSTGRES to a wide-area distributed DBMS 



that makes decisions based on an economic paradigm. 
This system is called Mariposa." In our CO TS system, 
a bad impedance mismatch exists between the DBMS 
and the tertiary memory file systems. We have there- 
fore shifted our research focus to constructing an 
intelligent mass storage interface that properly sup- 
ports a DBMS. 

Finallv, the Sequoia 2000 network currently sup- 
ports service guarantees, but there is no economic 
framework in which to place multiple lev els of service. 
As a result, our networking research is focused on con- 
struction of this type of framework. 

We anticipate a robust production env ironment for 
earth science researchers by the end ofT995. In addi- 
tion, we expect to continue to improve the Sequoia 
2000 environment with future research results in the 
above areas. 
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A major effort in the Sequoia 2000 project was to 
build a very large database of earth science infor- 
mation. Without providing the means for scien- 
tists to efficiently and effectively locate required 
information and to browse its contents, how- 
ever, this vast database would rapidly become 
unmanageable and eventually unusable. The 
Sequoia 2000 Electronic Repository addresses 
these problems through indexing and retrieval 
software that is incorporated into the POSTGRES 
database management system. The Electronic 
Repository effort involved the design of proba- 
bilistic indexing and retrieval for text documents 
in POSTGRES, and the development of algo- 
rithms for automatic georef erencing of text 
documents and segmentation of full texts 
into topically coherent segments for improved 
retrieval. Various graphical interfaces support 
these retrieval features. 



Global change researchers, who study phenomena that 
include the Greenhouse Effect, ozone depletion, 
global climate modeling, and ocean dynamics, have 
found serious problems in attempting to use current 
information systems to manage and manipulate the 
diverse information sources crucial to their research.' 
These information sources include remote sensing data 
and images from satellites and aircraft, databases of 
measurements (e.g., temperature, wind speed, salinity, 
and snow depth) from specific geographic locations, 
complex vector information such as topographic maps, 
and large amounts of text from a varietv of sources. 
These textual documents range from environmental 
impact reports on various regions to journal articles 
and technical reports documenting research results. 

The Sequoia 2000 project brought together com- 
puter and information scientists from the University 
of California (DC), Digital Equipment Corporation, 
and the San Diego Supercomputer Center (SDSC), 
and global change researchers from UC campuses to 
develop practical solutions to some of these problems. : 
One goal of this collaboration was the development of 
a large-scale (i.e., mulriterabyte) storage system that 
would be av ailable to the researchers over high-speed 
network links. In addition to storing massive amounts 
of data in this svstem, global change researchers 
needed to be able to share its contents, to search for 
specific known items in it, and to retrieve relevant 
unknown items based on various criteria. This sharing, 
searching, and retrieving had to be done efficiently 
and effectively, even when the scale of the database 
reached the multite mbyte range. 

The goal of the Electronic Repository portion of 
the Sequoia 2000 project was to design and ev aluate 
methods to meet these needs for sharing, searching, 
and retrieving database objects (primarily text docu- 
ments). The Sequoia 2000 Electronic Repository 
is the precursor of several ongoing projects at 
the University of California, Berkeley, that address 
the development of digital libraries. 

For repository objects to be effectively shared and 
retrieved, they must be indexed by content. User inter- 
faces must allow researchers to both search for items 
based on specific characteristics and browse the repos- 
itory for desired information. This paper summarizes 
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the research conducted in these areas by the Sequoia 
2000 project participants. In particular, the paper 
describes the Lassen text indexing and retrieval meth- 
ods developed for the POSTGRES database system, 
the GIPSY system for automatic indexing of" texts 
using geographic coordinates based on locations men- 
tioned in the text, and the Te.xtTiling method for 
improving access to full-text documents. 

Indexing and Retrieval in the Electronic Repository 

The primary engine for information storage and 
retrieval in the Sequoia 2000 Electronic Repository 
is the POSTGRHS next-generation database man 
agement system (DBMS).' POSTGRHS is the core of 
the DBMS-centric Sequoia 2000 svstem design. All 
the data used in the project was stored in POSTGRES, 
including complex multidimensional arrays of data, 
spatial objects such as raster and vector maps, satellite 
images, and sets of measurements, as well as all the 
full-text documents available. The POSTGRES DBMS 
supports user-defined abstract data tvpes, user-defined 
functions, a rules svstem, and many features of object- 
oriented DBMSs, including inheritance and methods, 
through functions in both the querv language, called 
POSTQUEL, and conventional programming lan- 
guages. The POSTQUEL query language provides all 
the features found in relational querv languages like 
SQL and also supports the nonrelational features of 
POSTGRES. These features give POSTGRES the abil- 
ity to support advanced information retrieval methods. 

We used these features of POSTGRES to develop 
prototype versions of advanced indexing and retriev al 
techniques for the Electronic Repository. We chose 
this approach rather than adopting a separate retrieval 
svstem for full-text indexing and retrieval for the fol- 
lowing reasons: 

1 . Text elements are pervasiv e in the database, ranging 
in size from short descriptions or comments on 
other data items to the complete text of large docu- 
ments, such as environmental impact reports. 

2. Text elements are often associated with other data 
items (e.g., maps, remote sensing measurements, 
and aerial photographs), and the system must sup- 
port complex queries involving multiple data types 
and functions on data. 

3. Many text-only systems lack support for concurrent 
access, crush recovery, data integrity, and security of 
the database, which are features of the DBMS. 

4. Unlike many text retriev al systems, DBMSs permit 
ad hoc querying of any element of the database, 
whether or not a predefined index exists for that 
element. 

Moreover, there are a number of interesting 
research issues involved in the integration of methods 



of text retrieval derived from information retrieval 
research with the access methods and facilities of 
a DBMS. Information retrieval has dealt primarily 
with imprecise queries and results that require human 
interpretation to determine success or failure based on 
some specified notion of relevance. Database systems 
have dealt w ith precise queries and exact matching of 
the querv specification. Proposals exist to add proba- 
bilistic weights to tuples in relations and to extend 
the relational model and query language to deal with 
the characteristics of text databases. li Our approach to 
designing this prototype was to use the features of the 
POSTGRES DBMS to add information retrieval meth- 
ods to the existing functionality of the DBMS. This 
section describes the processes used in the prototvpe 
version of the Lassen indexing and retrieval svstem and 
also discusses some of the ongoing development work 
directed toward generalizing the inclusion of advanced 
information retrieval methods in the DBMS. 6 

Indexing 

The Lassen indexing method operates as a daemon 
invoked whenever a new text item is appended to the 
database. Several POSTGRES database relations (i.e., 
classes, in POSTGRES terminology) provide support 
for the indexing and retrieval processes. Figure 1 
shows these classes and their logical linkages. These 
classes are intended to be treated as system-level 
classes, which arc usually not seen by users. 

The vvnjndex class contains the complete Word Net 
dictionary and thesaurus. 7 It provides the normalizing 
basis for terms used in indexing text elements of the 
database. That is, all terms extracted from data elements 
in the database are converted to the word form used in 
this class. The POSTQUEL statement defining the 
class is 

create wn_index ( 

termid = int4, /* unique term ID */ 

word = text, /* the term or phrase */ 

pos = char, /* WordNet part of speech 

information */ 
sense_cnt = int2, /* number of senses of word */ 
ptruse_cnt = int2, /* types and locations of */ 
offset_cnt = int2, /* related terms in WordNet*/ 
ptruse = int2[] , /* database are stored in */ 
offset ■ int4 [] ) /* these arrays 

All other references to terms in the indexing process 
are actually references to the unique term identifiers 
( termid) assigned to words in this class. The vvnjndex 
dictionary contains individual words and common 
phrases, although in the prototvpe implementation, 
only single words are used for indexing purposes. The 
other parts of the record include WordNet database 
information such as the part of speech {pos) and an 
array of pointers to the different senses of the word. 

The kvv_term_doc_rel class provides a linkage 
between a particular text item in any class or text 
large object (we will refer to either as documents) and 
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Figure 1 

The Lassen POSTGRES Classes for Indexing and Their Linkages 

a particular term from the wnjndex class. The 
POSTQL'EL definition of this class is 

create kw_term_doc_rel ( 

termid = int4, /* WordNet termid number */ 
synset = int4, /* WordNet sense number */ 
docid = int4, /* document ID */ 

termfreq = int4) /* term frequency within 
the document */ 

The raw frequency of occurrence of the term 
in the document (termfreq) is included in the 
kw_term_doc_rel tuple. This information is used in 
die retrieval process for calculating the probability of 
relevance for each document that contains the term. 
The k\v_doc_index class stores information on indi- 
vidual documents in the database. This information 
includes a unique document identifier (docid), the 
location of the document ( the class, the attribute, and 
the tuple in which it is contained), and whether it is 
a simple attribute or a large object (with effectively 
unlimited size). The kw_doc_index class also main- 
tains additional statistical information, such as the 
number of unique terms found in the document. The 
POSTQUEL definition is as follows: 



create kw_doc_index ( 



docid = int4, 
reloid = oid, 

attroid = oid, 

attrnum = int2, 

tupleid = oid, 

sourcetype = int4, 

doc_len = int4, 
doc_ulen = int4) 



/* document ID */ 
/* oid of relation 

containing it */ 
/* attribute definition of 

attr containing it */ 
/* attribute number of attr 

containing it */ 
/* tuple oid of tuple 

containing it */ 
/* type of object -- attribute 

or large object */ 
/* document length in words */ 
/* number of unique words in 

document */ 



The kwsources class contains information about 
the classes and attributes indexed at the class level, as 
w ell as statistics such as the number of items indexed 
from any given class. The following POSTQUEL 
statement defines this class: 



create kw_sources ( 

relname = char 16, /* 

reloid = oid, /* 

attrname = charl6, /* 

attroid = oid, /* 

attrnum = int2, /* 

attrtype = int4, /* 

num_indexed = int4, /* 

last_tid = oid, /* 

last_time = abstime, /* 

tot_terms = int4, /* 

tot_uterms = int4, /* 

include_pat = text, /* 

exclude_pat = text) /* 
/* 



name of indexed 

relation */ 

oid of indexed 

relation */ 

name of indexed 

attribute */ 

object ID of indexed 

attribute */ 

number of indexed 

attribute */ 

attribute type -- large 

object or otherwise */ 

number of items 

indexed * / 

oid and time for last */ 
tuple added */ 
total terms from all 
items */ 

total unique terms from 
all items */ 
simple patterns to */ 
match for indexable 
items */ 



The other classes shown in Figure 1 relate to the 
indexing and retrieval processes. The Lassen prototype 
uses the POSTGRES rules svstem to perform such 
tasks as storing the elements of the bibliographic 
records in an appropriate normalized form and to trig- 
ger the indexing daemon. 

Defining an attribute in the database as indexable 
for information retrieval purposes (i.e., by appending 
a new tuple to the kw_sources definition) creates a rule 
that appends the class name and attribute name to the 



Digital Technical Journal 



Vol 7 No 3 1995 



kw_index_flags class whenever a new tuple is appended 
to the class. Another rule then starts the indexing 
process for the newly appended data. Figure 2 shows 
this trigger process. 

The indexing process extracts each unique keyword 
from the indexed attributes of the database and stores 
it along with pointers to its source document and its 
frequency of occurrence in kw_term_doc_rel. This 
process is shown in Figure 3. The indexing daemon 
and the rules system maintain other global frequency 
information. For example, the overall frequency of 
occurrence of terms in the database and the total num- 
ber of indexed items are maintained for retrieval pro- 
cessing. The indexing daemon attempts to perform 
any outstanding indexing tasks before it shuts down. It 
also updatesthe kw_doc_index tuple for a given index- 
able class and attribute with a time stamp for the last 
item indexed (lastjid and lastjime). This permits 
ongoing incremental indexing without having to 
reindex older tuples. 

Retrieval 

The prototype version of Lassen provides ranked 
retrieval of the documents indexed by the indexing 
daemon using a probabilistic retrieval algorithm. This 
algorithm estimates the probability of relevance for 
each document based on statistical information on 
term usage in a user's natural language query and in 
the database. The algorithm used in the prototype is 
based on the staged logistic regression method. 8 

A POSTGRES user-defined function invokes ranked 
retrieval processing. That is, from a user's perspective, 
ranked retrieval is performed by a simple function 
call (kwsearch) in a POSTQUEL query language 
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statement. Information from the classes created and 
maintained by the indexing daemon are used to esti- 
mate the probability of relevance for each indexed doc- 
ument. (Note that the full power of the POSTQUEL 
query language can also be used to perform conven- 
tional Boolean retrieval using the classes created by the 
indexing process and to combine the results of ranked 
retrieval with other search criteria.) Figure 4 shows the 
process involved in the probabilistic ranked retrieval 
from the repository database. 

The actual query to the Lassen ranked retrieval 
process consists simply of a natural language statement 
of the searcher's interests. The query goes through the 
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Figure 2 

The Lassen Indexing Trigger Process 



Figure 3 

The Lassen Indexing Daemon Process 
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The Lassen Retrieval Process 



same processing steps as documents in the indexing 
process. The individual words of the query are 
extracted and located in the \vn_inde.\ dictionary 
(after removing common words or "stopwords"). The 
tcrmids for matching words from wn_inde\ are then 
used to retrieve all the tuples in kw_tcrm_doc_rcl that 
contain the term. For each unique document identifier 
m this list of tuples, the matching kw_doc_indcx tuple 
is retrieved. With the frequency information contained 
in kw_term_doc_rel and kw_doc_indc.\, the estimated 
probability of relevance is calculated for each docu- 
ment that contains at least one term in common with 
the query. The formulae used in the calculation are 
based on experiments with full-text retrieval. s The 
basic equation for the probabilistic model used in 
Lassen states the following: The probability of the 
event that a document is relevant R. given that there 
is a set of /V'clues" associated with that document, A, 
for / = 1 , 2, .. ., A; is 

log 0(/?U,...,4v) = log O(R) + Xt'og 0(R\A,) 
- logO (/?)], (1) 



where for any events E and E[ the odds O(ElE') is 
P(E\R')/P( 7:1/;'), i.e., a simple transformation of the 
probabilities. Because there is not enough information 
to compute the exact probability of relevance for any 
user and any document, an estimation is derived based 
on logistic regression of a set of clues (usually terms or 
words) contained in some sample of queries and the 
documents previously judged to be relevant to those 
queries. For a set of .17 terms that occur in both a query 
and a given document, the regression equation is of 
the form 

u 

log 0(R\A ly . .,/!,;) * C + C, • J\M) ^X,„ , + ■•• 

M 1 

+ c K -J\M) 2Xa + c m M+ Ck JH\ (2) 
i 

where there are K retrieval variables X m ^ used to 
characterize each term or clue, and the c, coefficients 
are constant for a given training set of queries and 
documents. The coefficients used in the prototype 
were derived from analysis of full-text documents 
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and queries (with relevance judgments) from the 
TIPSTER information retrieval test collection. 7 The 
derivation of" this formula is given in "Probabilistic 
Retrieval Based on Staged Logistic Regression."'" 1 The 
full retrieval equation used for the protorvpe v ersion of 
retrieval described in this section is 

log G(K\A^...,A II ) *= - 3.51 



.1/+ 1 



37.4 2X„, + 0.330 2*-,.: 



- 0.1937 ^ X,,,, 



+ 0.0929.1/, 



(3) 



where 

X m j is the quotient of the number of times the nnh 
term occurs in the qucrv and the sum of the total 
number of terms in the query plus 35; 

X m 2 is the logarithm of the quotient arrived at bv 
dividing the number of times the nnh term occurs in 
the document by the sum of the total number of terms 
in the document plus 80; 



A',„ 3 is the logarithm of the quotient arrived at by 
dividing the number of times the nnh term occurs in 
the database (i.e., in all documents) by the total num- 
ber of terms in the collection; 

.1/ is the number of terms held in common bv the 
query and the document. 

Note that the .17 ' term called for in Equation 2 was 
not found to provide any significant difference in the 
results and was omitted from Equation 3. The con- 
stants 35 and 80, which were used in X,„ i and X nl2 , 
are arbitrary but appear to offer the best results when 
set to the average size of a query and the average size 
of a document for the particular database. The 
sequence of operations performed to calculate the 
probability of relevance is shown in Figure 5. Note 
that in the figure, k\ , kS represent the constants 
of Equation 3. 

The probability of relevance is calculated for each 
document (bv converting the logarithmic odds to a 
probability) and is stored along with a unique query 
identifier, the document identifier, and some location 
information in the kw_retrieval class. The query itself 
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and its unique identifier are stored in the kw_querv 
class. To see the results of the retrieval operation, the 
query identifier is used to retrieve the appropriate 
k\v_retrieval tuples, ranked in order according to the 
estimated probability of relevance. The lew _rctrieval 
and k\v_query classes have the following POSTQUEL 
definitions: 



create )cw_query ( 
query_id = int4, 
query_user = charl6, 
query_text = text ) 

create kw_retrieval ( 
query _id = int4 , 
doc_id = int4 , 
rel_oid = oid, 
attr_oid = oid, 
attr_num = int2, 
tuple_id = oid, 
doc_len = int4 , 
doc_match_terms = int4, 

doc_prob_rel = float8) 



/* ID number */ 

/* POSTGRES user name */ 

/* the actual query */ 



/* link to the query */ 
/* document ID number */ 
/* location of doc */ 



/* size of document */ 
/* number of query terms 

in the document */ 
/* estimated probability 

of relevance */ 



The algorithm used for ranked retrieval in the 
Lassen prototype was tested against a number of other 
systems and algorithms as part of the TREC competi- 
tion and provided excellent retrieval performance. 10 
We have found that the retrieval coefficients used in 
the formula deriv ed from analysis of the TIPSTER col- 
lection appear to work well for a variety of document 
types. In principle, the staged logistic regression 
retrieval coefficients should be adapted to the particu- 
lar characteristics of the database by collecting rele- 
vance judgments from actual users and reapplving the 
staged logistic regression analysis to deriv e new coeffi- 
cients. This activity has not been performed for this 
protorvpe implementation. 

The prirnarv contribution of the Lassen protorvpe 
has been as a proof-of-conccpt for the integration of 
full-text indexing and ranked retrieval operations in 
a relational database management system. The proto- 
type implementation that we have described in this 
section has a number of problems. For example, in the 
prototype design for indexing and retrieval operations, 
all the information used is visible in user-accessible 
classes in the database. Also, the overhead is fairly 
high, in terms of storage and processing time, for 
maintaining the indexing and retriev al information in 
this way. For example, POSTGIIES allocates 40 bytes 
of svstem information for each tuple in a class, and 
indexing can take several seconds per document. 

Currently, we are investigating a class of new access 
methods to support indexing and retrieval in a more 
efficient fashion. The class of methods involves declar- 
ing some POSTGRES functions that can extract 
subelements of a given tvpe of attribute (such as words 
in a text document) and generate indexes for each of 
the subelements extracted. Other types of data might 



also benefit from this class of access methods. For 
example, functions that extract subelements like geo- 
metric shapes from images might be used to generate 
subelcment indexes of image collections. Particular 
index element extraction methods can be of great 
value in providing access to the sort of information 
stored in the Sequoia 2000 Electronic Repositorv. The 
following section describes one such index extraction 
method developed for the special needs of Sequoia 
2000 data. 

GIPSY: Automatic Georeferencing of Text 

Environmental Impact Reports (EIRs), journal arti- 
cles, technical reports, and myriad other text items 
related to global change research that might be 
included in the Sequoia 2000 database are examples of 
a class of documents that discuss or refer to particular 
places or regions. A common retrieval task is to find 
the items that refer to or concentrate on a specific geo- 
graphic region. Although it is possible to have a 
human catalog each item for location, one goal of the 
Electronic Repositorv was to make all indexing and 
retrieval automatic, thus eliminating the requirement 
for human analvsis and classification of documents in 
the database. Therefore, part of our research involved 
developing methods to perform automatic georefer- 
encing of text documents, that is, to automatically 
index and retrieve a document according to the geo- 
graphic locations discussed or displayed in or other- 
wise associated with its content. 

In Lassen and most other full-text information 
retrieval systems, searches with a geographical compo- 
nent, such as "Find all documents whose contents per- 
tain to location X, 11 are not supported directly by 
indexing, querv, or display functions. Instead, these 
searches work onlv by references to named places, 
essentially as side effects of keyword indexing. Whereas 
human indexers are usually able to understand and 
applv correct references to a document, the costs in 
time and monev of using geographicallv trained human 
indexers to read and index the entire contents of a large 
full-text collection are prohibitive. Even in cases where 
a document is meticulously indexed manually, geo- 
graphic index terms consisting of keywords (text 
strings) have several well-documented problems with 
ambiguity, synonymy, and name changes over rime."- 1 -' 

Advantages of the GIPSY Model 

To deal with these problems, we developed a new 
model for supporting geographically based access to 
textT In this model, words and phrases that contain 
geographic place names or geographic characteristics 
are extracted from documents and used as input to 
certain database functions. These functions use spatial 
reasoning and statistical methods to approximate the 
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geographic position being referenced in the text. The 
actual index terms assigned to a document are a set of 
coordinate polygons that describe an area on the 
earth's surface in a standard geographical projection 
system. Using coordinates instead of names for the 
place or geographic characteristic offers a number of 
advantages. 

■ Uniqueness. Place names are not unique, e.g., 
Venice, California, and Venice, Italy, are not appar- 
ently different without the qualifying larger region 
to differentiate them. Using coordinates removes 
this ambiguity. 

■ Immunity to spatial boundary changes. Political 
boundaries change over time, leading to confusion 
about the precise area being referred to. Coordi- 
nates do not depend on political boundaries. 

■ Immunity to name changes. Geographic names 
change over time, making it difficult for a user to 
retrieve all information that has been written about 
an area during any extended time period. Coordi- 
nates remove this ambiguity. 

■ Immunity to spatial, naming, and spelling varia- 
tion. Names and terms vary not only over time but 
also in contemporary usage. Geographic names 
vary in spelling ox er time and by language. Areas of 
interest to the user will often be given place names 
designated only in the context of a specific docu- 
ment or project. Such variations occur frequently 
for studies done in oceanic locations. Names associ- 
ated with these studies are unknown to most users. 
Coordinates are not subject to these kinds of verbal 
variations. 

I ndexing texts and other objects (e.g., photographs, 
videos, and remote sensing data sets) by coordinates 
also permits the use of a graphical interface to the 
information in the database, where representations of 
the objects are plotted on a map. A map-based graphi- 
cal interface has several advantages over one that uses 
text terms or one that simply uses numerical access to 
coordinates. As Furnas suggests, humans use different 
cognitive structures for graphical information than for 
verbal information, and spatial queries cannot be fullv 
simulated by verbal queries. 14 Because manv geo- 
graphical queries are inherently spatial, a graphical 
model is more intuitive. This is supported by Morris 1 
observation that users given the choice between menu 
and graphical interfaces to a geographic database pre- 
ferred the graphical mode.'* A graphical interface, 
such as a map, also allows for a dense presentation of 
information. 1 " 

To address the needs of global change scientists, the 
Sequoia 2000 project team proposed a new browser 
paradigm. 1 '" This system, called Tioga, displays infor- 
mation topologically according to continuous charac- 
teristics that are attributes of the data. 18 For example, 



documents may be displayed on a map according to 
their latitude and longitude. Documents may also be 
displayed according to the time at which they were 
generated and the time to which they refer, as well as 
by more abstract functions such as the reading level of 
the document and the author's attitudes as expressed 
in the document. A prototype of the geographical 
browsing component was included in the Lassen 
Geographic Browser, which is shown in Figure 6. 

This browser allows any georeferenced object in the 
database to be indicated by an icon on the map. The 
user emplovs the mouse to center the map on any 
location and to zoom in or out for more or less map 
detail. Icons can be made to appear at any coordinates 
and for any range of magnification values. When an 
icon is selected bv the user, a menu of the objects geo- 
referenced at the icon coordinates and detail level are 
displayed for selection. 

An Algorithm to Georeference Text 

The advantages of georeferencing are apparent. Not so 
apparent is how to perform such a task automatically. 
We developed the following three-part thesaurus- 
based algorithm to explore this task; the algorithm pro 
vides the basis for georeferencing in GIPSY. 19 

1. Identify' geographic place names and phrases. This 
step attempts to recognize all relevant content- 
bearing geographic words and phrases. The parser 
for this step must "understand" how to identify' 
geographic terminology of two tvpes: 

a. Terms that match objects or attributes in the 
data set. This step requires a large thesaurus of 
geographic names a nd terms, partially hand built 
and partially automatically generated. 

b. Lexical constructs that contain spatial informa- 
tion, e.g., "adjacent to the coast," "south of the 
delta," and "between the river and the highw ay." 

To implement this part of the algorithm, a list of 
the most commonly occurring constructs must be 
created and integrated into a thesaurus. 

2. Locate pertinent data. The output of the parser is 
passed to a function that retrieves geographic coor- 
dinate data pertinent to the extracted terms and 
phrases. Spatially indexed data used in this step can 
include, for example, name, size, and location of 
cities and states; name and location of endangered 
species; and name, location, and bioregional char- 
acteristics of different climatic regions. The system 
must identify the spatial locations that mostcloselv 
match the geographic terms extracted bv the parser 
and, when geographic modifiers are used, heuristi- 
cally modify the area of coverage. For example, the 
phrase "south of Lake Tahoe" will map to the area 
south of Lake Tahoe, covering approximately the 
same volume. This spatial representation is, by 
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Figure 6 

Screen from the Lassen Geographic Browser 

necessity, the result of an arbitrary assumption 
of si/.e, but its purpose is to provide only partial 
evidence to be used in determining locations as 
described below. 

Since gcopositional data for land use (e.g., cities, 
schools, and industrial areas) and habitats (e.g., 
wetlands, rivers, forests, and indigenous species) 
is also available, extracted keywords and phrases for 
these types of data must be recognized. The the- 
saurus entries for this data should incorporate sex - 
era I other types of information, such as synonymy 
(e.g., Latin and common names of species) and 
membership (e.g., wetlands contain cattails, but 
geopositional data on cattails may nor exist, so we 
must use their mention as weak evidence of a dis- 
cussion of wetlands and use that data instead). 

Forour implementation of GIPSY, we used two pri- 
mary data sets to construct the thesaurus. The first 
was a subset of the United States Geological 
Survey's Geographic Names Information System 
(GNIS). This data set contains latitude/longitude 
point coordinates associated with over 60,000 geo- 
graphic place names in California. To facilitate 



comparison with other data sets, the GNIS 
latitude/longitude coordinates were converted to 
the I.ambcrt-Azimuthal projection. Examples of 
place names \\ ith associated points include 

University of California Davis: -1867878 -471379 

Redding: -1863339 -234894 

Data for land use and habitat data was derived in 
the United States Geological Survey's Geographic 
Information Retrieval and Analysis S\ stem 
(GIRAS). :) 

Each identified name, phrase, or region description 
is associated with one or more polygons that may 
be the place discussed in the text. Weights can be 
assigned to each of these polygons based on the fre- 
quency of use of its associated term or phrase in the 
text being indexed and in the thesaurus. Many rele- 
vant terms do not exactly match place names or the 
feature and land use types listed above. For exam- 
ple, alfalfa is a crop grown in California and should 
be associated with the crop data from the GIRAS 
land use data set. The thesaurus was therefore 
extended, both manually and by extraction of 
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relationships from the Word Net thesaurus, to 
include the following types of terms: 7 

synonymy 

= : = svnonvm 

kind -of relationships 

~ : = hyponvm (maple is a ~ of tree) 
@ : = hypernvm (tree is a @ of maple) 

part-of relationships 

# : = meronym (finger is a # of hand) 
% : = holonvm (hand is a % of finger) 
& : = evidonvm (pine is a & of shortleaf 
pine) 

3. Overlay polygons to estimate approximate loca- 
tions. The objective of this step is to combine the 
evidence accumulated in the preceding step and 
infer a set of polygons that provides a reasonable 
approximation of the geographical locations men- 
tioned in the text. Bach geopbmsc. tfcigbt. polygon 
tuple can be represented as a three-dimensional 
"extruded" polygon w hose base is in the plane of 
the v- and r axes and w hose height extends upward 
on the r-axis a distance proportional to its weight 
(see Figure 7a). As new polygons are added, several 
cases may arise. 

a. If the base of a polygon being added does not 
intersect with the base of any other polygons, it 
is simplv laid on the base map beginning at )■= 0 
(see Figure 7b). 

b. If the polygon being added is completely con- 
tained within a polvgon that already exists on the 
geoposirional skyline, it is laid on top of that 
extruded polvgon, i.e., its base plane is posi- 
tioned higher on the jy-axis (see Figure 7c). 

c. If the polygon being added intersects but is not 
wholly contained by one or more polygons, the 
polvgon being added is split. The intersecting 
portion is laid on top of the existing polvgon and 
the nonintersecting portion is positioned at a 
lower level (i.e., at y= 0). To minimize fragmen- 
tation in this case, polygons are sorted bv size 
prior to being positioned on the skyline (see 
Figure 7d). 

In effect, the extruded polygons, when laid 
together, are "summed" by weight to form a geoposi- 
tional skyline whose peaks approximate the geograph- 
ical locations being referenced in the text. The 
geographic coordinates assigned to the text segment 
being indexed are determined by choosing a threshold 
of elevation .?in the skyline, taking the .v-zplane at z. 
and using the polygons at the selected elevation. 
Raising the elevation of the threshold will tend to 
increase the accuracy of the retrieval, whereas lowering 
the elevation tends to include other similar regions. 



To see the results of this process in the GIPSY proto- 
type, consider the follow ing text from a publication of 
the California Department ofWater Resources: 

The proposed project is the construction of a new 
State Water Project (SVVP) facility, the Coastal Branch, 
Phase II, bv the Department of Water Resources 
(DVVR) and a local distribution facility, rhe Mission 
Hills Extension, by water purveyors of northern Santa 
Barbara County. This proposed buried pipeline 
would deliver 25,000 acre-feet per year (AF/YR) of 
SVVP w ater to San I .ins Obispo County Flood Control 
and Water Conservation District (SI. OCFCWCD) and 
27,723 AF/YR to Santa Barbara County Flood Control 
and Water Conservation District (SRCFCWCD). . . 
This extension would serve the South Coast and 
Upper Santa Yncz Valley. DWR and the Santa Barbara 
Water Purveyors Agency are jointly producing an 
KIR for the Santa Yncz Extension. The Santa 
Yncz Extension Draft EIR is scheduled for release in 
spring 1991 ." 

The resulting surface plot appears in Figure 8. The 
figure contains a gridded representation ofthe state of 
California, which is elevated to distinguish it from the 
base ofthe grid. The northern part of the state is on 
the left-hand side ofthe image. The towers rising over 
the state's shape represent polygons in the skvlinc 
generated bv GIPSY's interpretation ofthe text. The 
largest towers occur in the area referred to bv the text, 
primarily centered on Santa Barbara County, San Luis 
Obispo, and the Santa Ynez Vallcv area. 

The surface plots generated in this fashion can also 
be used for browsing and retrieval. For example, the 
two-dimensional base of a polvgon with a thickness 
above a certain threshold can be assigned as a coordi- 
nate index to a document. These two-dimensional 
polygons might then be displayed as icons on a map 
browser such as the one shown in Figure 6. 

Future Work 

Research remains to be done on several extensions to 
the existing GIPSY implementation. Because a geo- 
graphic Icnowledge base and spatial reasoning are fun- 
damental to the georeferencing process, they have- 
been the f ocus of initial research ef forts. 

The existing prototype can be complemented by 
the addition of more sophisticated natural language 
processing. For example, spatial reasoning and geo- 
graphic data could be combined with parsing tech- 
niques to develop semantic representations of the 
text. Adjacency indicators, such as "south of" or 
"between," should be recognized bv a parser. Also, 
the work on document segmentation described below 
could be used to explore the locality of reference to 
geographic entities within full-text documents. 
GIPSY's technique may be most effective when 
applied to a paragraph or section level of a text instead 
of to the entire document. 
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(a) The "weight" of a polygon, indicated by the 
vertical arrow, is interpreted as "thickness." 





(b) Two adjacent polygons do not affect each other; 
each is merely assigned its appropriate "thickness." 





When one polygon subsumes another, their 
"thicknesses" in the area of overlap are summed. 





(d) When two polygons intersect, their "thicknesses" 
are summed in the area of overlap. 



Figure 7 

Overlaying Polygons to Estimate Approximate Locations 
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Figure 8 

Surface Plot Produced from the State Water Project Text 



TextTiling: Enhancing Retrieval through 
Automatic Subtopic Identification 

Full-length documents have only recently become 
available on-line in large quantities, although technical 
abstracts, short newswire texts, and legal documents 
have been accessible For many years. 2 ' The large major- 
ity of on-line information has been bibliographic (e.g., 
authors, titles, and abstracts) instead of the full text of 
the document. For this reason, most information 
retrieval methods arc better suited for accessing 
abstracts than for accessing longer documents. Part of 
the repository research was an exploration of new- 
approaches to information retrieval particularly suited 
to full-length texts, such as those expected in the 
Sequoia 2000 database. 

A problem with applying traditional information 
retrieval methods to full-length text documents is that 
the structure of full-length documents is quite differ- 
ent from that of abstracts. (In this paper, "full-length 
document" refers to expository text of any length. 
Typical examples are a short magazine article and 
a 50-page technical report. We exclude documents 
composed of headlines, short advertisements, and any 
other disjointed texts of whatever length. We also 
assume that the document does not have detailed 
orthographically marked structure. Croft, Krovetz, 
and Turtle describe work that takes advantage of this 
kind of information. 2 ' 1 ) One way to view an expository 
text is as a sequence of subtopics set against a backdrop 
of one or two main topics. A long text comprises many 
different subtopics that may be related to one another 
and to the backdrop in many different ways. The main 
topics of a text are discussed in its abstract, if one 
exists, but subtopics are usually not mentioned. 
Therefore, instead of querying against the entire 
content of a document, a user should be able to issue a 



query about a coherent subpart, or subtopic, of a full- 
length document, and that subtopic should be specifi- 
able with respect to the document's main topic(s). 

Consider a Discover magazine article about the 
Magellan space probe's exploration of Venus. 25 
A reader divided this 23-paragraph article into the fol- 
lowing segments with the labels shown, where the 
numbers indicate paragraph numbers: 

1-2 Intro to Magellan space probe 

3-4 Inrro to Venus 

5-7 Lack of craters 

8-1 1 Evidence of volcanic action 
12-15 River Styx 
16-18 Crustal spreading 
19-21 Recent volcanism 
22-23 Future of Magellan 

Assume that the topic of volcanic activiry is of 
interest to a user. Crucial to a system's decision to 
retrieve this document is the knowledge that a dense 
discussion of volcanic activity, rather than a passing ref- 
erence, appears. Since volcanism is not one of the 
text's two main topics, the number of references to 
this term will probably not dominate the statistics of 
term frequency. On the other hand, document selec- 
tion should not necessarily be based on the number of 
references to the target terms. 

The goal should be to determine whether or not 
a relevant discussion of a concept or topic appears. 
A simple approach to distinguishing between a true 
discussion and a passing reference is to determine the 
locality of the references. In the computer science 
operating systems literature, locality refers to the fact 
that over time, memory access patterns tend to con- 
centrate in localized clusters rather than be distributed 
evenly throughout memory. Similarly, in full-length 
texts, the close proximity of members of a set of 



Digital Technical Journal 



Vol. 7 No. 3 1995 



references to a particular concept is a good indicator of 
topicality. For example, the term volcanism occurs 5 
times in the Magellan article, the first four instances of 
which occur in four adjacent paragraphs, along with 
accompanying discussion. In contrast, the term scien- 
tists, w hich is not a valid subtopic, occurs ] 3 rimes, dis- 
tributed somewhat evenly throughout. By its very 
nature, a subtopic will not be discussed throughout an 
entire text. Similarly, true subtopics are nor indicated 
by only passing references. The term belly dancer 
occurs only once, and its related terms are confined to 
the one sentence it appears in. As its usage is only 
a passing reference, bellv dancing is nor a true subtopic 
of this text. 

Our solution to the problem of retaining valid 
subtopical discussions while at the same time avoid- 
ing being fooled by passing references is to make 
use of locality information and to partition docu- 
ments according to their subtopical structure. This 
approach's capacity tor improving a standard informa- 
tion retrieval task has been verified by information 
retrieval experiments using full-text test collections 
from the TIPSTER database/"- 17 

One way to get an approximation of the subtopic 
structure is to break the document into paragraphs, or 
for very long documents, sections. In both cases, this 
entails using the orthographic marking supplied by the 
author to determine topic boundaries. 

Another wav to approximate local structure in long 
documents is to divide the documents into even-sized 
pieces, without regard for any boundaries. This 
approach is nor practical, however, because we arc 
interested in exploring the performance of motivated 
segmentation, i.e., segmentation that reflects the 
text's true underlying subtopic structure, which often 
spans paragraph boundaries. 

Toward this end, we have developed TcxtTiling, 
a method for partitioning full-length text documents 
into coherent multiparag/aph units called tilcs.'"-'*-^ 
TexrTiling approximates the subtopic structure of 
a document by using patterns of lexical connectivity to 
find coherent subdiscussions. The lav»ut of the riles is 
meant to reflect the pattern of subtopics contained in 
an expositors- text. The approach uses quantitative lex- 
ical analyses to determine the extent of the tiles and to 
classify them with respect to a general knowledge base. 
The riles have been found to correspond well to 
human judgments of the major subtopic boundaries of 
science magazine articles. 

The algorithm is a two-step process. First, all pairs of 
adjacent blocks of text (where blocks arc usual Iv three 
to five sentences long) arc compared and assigned 
a similarity value. Second, the resulting sequence of 
similarity values, after being graphed and smoothed, is 
examined for peaks and valleys. High similarity v alues, 
which imply that the adjacent blocks cohere well, rend 



to form peaks, whereas low similarity values, which 
indicate a potential boundary between tiles, create val- 
leys. Figure 9 shows such a graph for the Discover 
magazine article mentioned earlier. The vertical lines 
indicate where human judges thought the topic 
boundaries should be placed. The graph shows the 
computed similarity of adjacent blocks of text. Peaks 
indicate coherency, and valleys indicate potential 
breaks between til es. 

The one adjustable parameter is the size of the block 
used for comparison. This value, k, varies slighrlv from 
text to text. As a heuristic, it is assigned the average 
paragraph length (in sentences), although the block 
size that best matches the human judgment data is 
sometimes one sentence greater or smaller. Actual 
paragraphs are not used because their lengths can be 
highly irregular, leading to unbalanced comparisons. 

Similarity is measured by using a variation of the 
tfiidf (term frequency times inverse document fre- 
quency) measurement. 5 " In standard tf.idf, terms that 
are frequent in an individual document but relatively 
infrequent throughout the corpus are considered to 
be good distinguishers of the contents of the individ- 
ual document. In TcxtTiling, each block of k sen- 
tences is treated as a unit, and the frequency of a term, 
within each block is compared to its frequency in the 
entire document. (Note that the algorithm uses a large 
stop list; i.e., closed class words and other very fre- 
quent terms arc omitted from the calculation.) This 
approach helps bring our a distinction between local 
and global extent of terms. A term t hat is discussed fre- 
quently within a localized cluster (thus indicating 
a cohesiv e passage) will be weighted more heavily than 
a term that appears frequently but scattered evenly 
throughout the entire document, or infrequently 
within one block. Thus if adjacent blocks share many 
terms, and those shared terms are weighted heavily, 
there is strong evidence that the adjacent blocks 
cohere vv ith one another. 
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Results of TextTiline a 77-sentence Science Article 
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Similarity between blocks is calculated by the follow- 
ing cosine measure: Given two text blocks hi and hi, 



cos (M, 62) = 



2 "''W 



v\ here / ranges over all the terms i n the document, and 
(I) i,\ is the tf'.idf weight assigned to term t in block 61 . 
Thus, if the similarity score between two blocks is 
high, then not only do the blocks have terms in com- 
mon, bur the common terms are relatively rare with 
respect to the rest of the document. The evidence in 
the reverse is not as conclusive. If adjacent blocks have 
a low similarity measure, this does not necessarily 
mean that the blocks cohere. In practice, however, this 
negative evidence is often justified. 

The graph is then smoothed using a discrete convo- 
lution-" of the similarity function with the function 
6 A (,), where 



&*(/)- 



\i\<k-l 
0, otherwise. 



The result is smoothed further with a simple median 
smoothing algorithm ro eliminate small local min- 
ima.''- Tile boundaries are determined by locating the 
lowermost portions of valleys in the resulting plot. 
The actual values of the similarity measures are not 
taken into account; the relative differences are what 
are of consequence. 

Retrieval processing should reflect the assumption 
that full-length te.\t is meaningfully different in struc- 
ture from abstracts and short articles. We have con- 
ducted retrieval experiments that demonstrate that 
taking text structure into account can produce better 
results than using full-length documents in the standard 
way. 3 "- 11 ' 1 " By working within this paradigm, we have 
developed an approach ro vector-space-based retrieval 
that appears ro work better than retrieving against entire 
documents or against segments or paragraphs alone. 

The resulting retrieval method matches a query 
against motivated segments and then sums the scores 
from the top segments for each document. The high- 
est resulting sums indicate which documents should 
be retrieved. In our rest set, this method produced 
higher precision and recall than retrieving against 
entire documents or against segments or paragraphs 
alone.-"' AJ though the vector-space model of retrieval 
was used for these experiments, probabilistic models 
such as the one used in Lassen are equally applicable, 
and the method should provide similar improvement 
in retrieval performance. 

We believe that recognizing the structure of full- 
length text for the purposes of information retrieval 



is very important and will produce considerable 
improvement in retrieval effectiveness over most exist- 
ing similarity- based techniques. 

Conclusion 

The Sequoia 2000 Electronic Repository project has 
provided a rest bed for developing and evaluating tech- 
nologies required for effective and efficient access to 
the digital libraries of the future. We can expect that as 
digital libraries proliferate and include vast databases of 
information linked together by high-bandwidth net- 
works, they must support all current and future media 
in an easily accessible and content-addressable fashion. 

The work begun on the Sequoia 2000 Electronic 
Repository is continuing under UC Berkeley's digital 
library project sponsored jointly by the National 
Science Foundation (NSF), the National Aeronautics 
and Space Administration (NASA), and the Defense 
Advanced Research Projects Agency (DAKPA). 
Digital libraries are a fledgling technology with no 
firm standards, architectures, or even consensus 
notions of what they are and how they are to work. 
Our goal in this ongoing research is to develop the 
means of placing the contents of this developing 
global virtual library at the fingertips of a worldwide 
clientele. Achiev ing this goal will require the applica- 
tion of advanced techniques for information retrieval, 
information filtering, resource discovery, and the 
application of new techniques for automatically ana- 
lyzing and characterizing data sources ranging from 
texts to videos. Much of the work needed to enable 
our vision of these new technologies w as pioneered in 
the Sequoia 2000 Electronic Repository project. 
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Tecate: A Software 
Platform for Browsing 
and Visualizing Data 
from Networked Data 



Sources 

Tecate is a new infrastructure on which applica- 
tions can be constructed that allow end users 
to browse for and then visualize data within 
networked data sources. This software platform 
capitalizes on the architectural strengths of cur- 
rent scientific visualization systems, network 
browsers like Netscape, database management 
system front ends, and virtual reality systems. 
Applications layered on top of Tecate are able 
to browse for information in databases man- 
aged by database management systems and for 
information contained in the World Wide Web. 
In addition, Tecate dynamically crafts user inter- 
faces and interactive visualizations of selected 
data sets with the aid of an intelligent system. 
This system automatically maps many kinds of 
data sets into a virtual world that can be explored 
directly by end users. In describing these virtual 
worlds, Tecate uses an interpretive language that 
is also capable of performing arbitrary compu- 
tations and mediating communications among 
different processes. 



All people share the need ro find and assimilate infor- 
mation. Data from which information is created is 
increasingly available electronically, and that data 
is becoming more and more accessible widi the prolif- 
eration of computer networks. Therefore, the world 
is quickly becoming abstracted as a collection of net- 
worked data spaces, where a data space is a data source 
or repository whose access is controlled by means of 
a well-defined software interface. Some examples 
of data spaces are a database managed by a database 
management system, the World Wide Web (WWW or 
Web), and any data object that resides in a computer's 
main memory and whose components arc accessible 
through the object's methods. 

The need ro locate data and then map it to a form 
that is readily understood lies at the core of learning, 
conducting commerce, ajid being entertained. To 
address this need, interactive tools are required for 
exploring data spaces. These tools should allow any 
end user to browse the contents of data spaces and to 
inspect, measure, compare, and identify patterns in 
selected data sets. Combining both tasks into one tool 
is both elegant and utile in that end users need to leam 
only one system to seamlessly switch back and forth 
between browsing for data and assimilating it. Before 
such applications can be constructed, however, a firm 
foundation must be defined that provides an interface 
to data spaces, helps map data into a visual representa- 
tion, and manages user interactions with elements in 
the visualizations. 

This paper describes one such software platform, 
called Tecate, which has been implemented as a 
research prototype ro help understand the issues 
involved in exploring data spaces. With Tecate, the 
emphasis has been on developing the tools needed to 
build end-to end applications. Such applications can 
access data spaces, automatically create virtual worlds 
that represent data found in data spaces, and give end 
users the ability to navigate and interact with those 
worlds as the mechanism for exploring data spaces. 
Recause of this emphasis, Tecare's development con- 
centrated on understanding what system components 
are needed to create end-to-end applications and how 
those components interact rather than on the func- 
tionality of individual components. As a consequence, 
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the tools provided by Tecate can be used to build 
applications of only modest capabilities. 

Historically, Tecate grew out of the Sequoia 2000 
project, which was initiated jointly by Digital Equip- 
ment Corporation and the University of California 
in 1991. The primary purpose of the Sequoia 2000 
project was to develop information systems that would 
allow earth scientists to better study global en\T- 
ronmental change. Sequoia 2000 participants needed 
to browse for data sets on which to test scientific 
hypotheses and then to interactively visualize the data 
sets once found. The data can be quite varied in con- 
tent and structure, ranging from text and images 
to time varying, multidimensional, gridded or poly- 
hedral data sets. Such data may stream from many dif- 
ferent sources, e.g., databases managed by a database 
management system, a running simulation of some 
physical process, or the WWW. Therefore, a tool was 
required that could interface to any such source. To be 
of maximum use, though, the tool had to be easy 
to use so that the scientists themselves could make 
sophisticated data queries and then experiment with 
the query results using a wide variety of data visualiza- 
tion techniques. 

Generalizing from its Sequoia 2000 roots, the 
design of Tecate is intended to achieve four goals: 

1 . Interface to general data spaces wherever they may 
reside. 

2. SaJiently visualize most kinds of data, e.g., scientific 
data and the listings in a telephone book. 

3. Dynamically craft user interfaces and interactive 
visualizations based on what data is selected, who is 
doing the visualizing, and why the user is exploring 
the data. 

4. Allow end users to interact with elements in visual- 
izations as a means to query data spaces, to explore 
alternate ways of presenting information, and to 
make annotations. 

There arc systems available today that have some of 
these capabilities, but no one system possesses all four. 
Data visualization systems such as AVS, Klioros, or 
Data Explorer are capable of visualizing scientific data; 
however, they are poor at interfacing to general data 
spaces, they provide only limited interactivity within 
visualizations themselves, and they require visualiza- 
tions to be crafted by hand by knowledgeable end 
users. 1 ' 3 -' Network browsers such as Netscape are go«d 
at fetching data from certain types of data spaces but 
are limited in the variety of data they can directly visu- 
alize without having to rely on external viewer pro- 
grams. Moreover, most network browsers offer a 
restricted type of interactivity where only hyperlinks 
can be followed and text can be submitted through 
forms. Finally, front ends to database management 
systems provide elaborate querying mechanisms for 



selecting data from a database, but they lack a sophisti- 
cated means for visualizing and further exploring 
query results. 

The Tecate architecture borrows from that of visu- 
alization systems, network browsers, and database 
management systems as well as from virtual reality sys- 
tems like Alice and the Minimal Reality Toolkit/ 
Object Modeling Language (MR/OML). 45 One 
major contribution of the Tecate system is that it 
incorporates the architectural strengths of these 
systems into a coherent whole. In addition, Tecate 
possesses at least two novel features that are not found 
in other data visualization systems. One feature is 
Tecate's use of an interpretive language that can 
describe three-dimensional (3-D) virtual worlds. This 
language is more than a markup language in that it is 
capable of performing arbitrary computations and 
facilitating communication among different processes. 
The second novel component of Tecate is the presence 
of an expert system that automatically crafts interactive 
visualizations of data. This system is intended to make 
data space exploration easier to perform by having end 
users simply state their goals while leaving the details 
of implementing a visualization to attain those goals to 
the expert system. 

The remainder of the paper outlines Tecate's sys- 
tem model and architecture and then identifies and 
describes Tecate's major components. Finally, the 
paper sketches Tecate's capabilities by discussing two 
simple applications that have been implemented on top 
of the Tecate software framework. The first application 
is a tool for visualizing earth science data residing in 
a database managed by a database management system. 
The second application is a Web browser that uses 3-D 
graphics as an underlying browsing paradigm rather 
than depending solely on the medium of hypertext. 

Tecate's System Model 

After presenting an overview of Tecate's system model, 
this section provides details of the object model and the 
interpretive, object-oriented language used to describe 
virtual world objects. 

Overview 

From the standpoint of an applications programmer, 
Tecate is a distributed, object-oriented system. All 
major components of Tecate, as well as entities appear- 
ing in virtual worlds created by Tecate, are objects that 
communicate with one another by means of message 
passing. The main focus within Tecate is on object- 
object interactions. These interactions occur primarily 
when objects send messages to one another. An object 
can also send a message to itself, which has the effect of 
making a local function call. Unlike with graphics 
systems such as Open Inventor, rendering is not a cen- 
tra! activity within Tecate; rather it is just a side effect 
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of object-object interactions. 6 In this sense, Tecate is 
like virtual reality programming systems such as Alice 
and MR/OML, although Tecate is far more flexible. 

In the Tecate system, objects can create and destroy 
other objects and can alter the properties of existing 
objects on-the -fly. Such capabilities make Tecate verv 
extensible and give it great power and flexibility. These 
capabilities can also cause problems for applications 
programmers, however, if care is not taken when writ- 
ing programs. Presently, all of an object's properties 
are visible to all other objects, and hence those proper- 
ties can be manipulated from outside the object. In the 
future, some f orm of selective property hiding needs 
to be added so that designated properties of an object 
cannot be altered by other objects. 

A powerful feature of Tecate is its ability to dynami- 
cally establish object-subobject relationships. This fea- 
ture provides a mechanism for building assemblies of 
parts similar to the mechanisms in classical hierarchical 
graphics systems like Dorc or Open Inventor. 7 This 
feature also provides the capability of creating sets or 
aggregates of objects that share some trait, such as 
being highlighted. Tecate allows all objects within a set 
to be treated en masse by providing a means of selec- 
tively broadcasting messages to groups of objects. 
A message that is sent to an object can be forwarded 
to all the object's subobjects. Thus, for example, one 
object can serve as a container for all other objects that 
are highlighted; the highlighted objects are merely sub- 
objects of the container. To un highlight all highlighted 
objects, a single unhighlight message can be sent to the 
container object, which then forwards the message to 
all its subobjects. In general, an object can be the sub- 
object of any number of other objects and thus simulta- 
neously be a member of many different sets. 

The handling of user input within Tecate is 
intended to appear the same as ordinary object-object 
interactions. All physical input devices that are known 
to Tecate have an agent object associated with them 
that acts as a device handler. All objects that wish to 
be informed of a particular input event register with 
the appropriate agent. When an input event occurs, 
the agent sends all registered objects a message notify- 
ing them of the event. Complex events, such as the 
occurrence of event A and event B within a specified 
time period, can easily be defined by creating new han- 
dler objects. These handlers register to be informed of 
separate events but then, in turn, inform other objects 
of the events' conjunction. 

The Object Model 

Tecate uses an object model in which no distinction 
is made between classes and instances, as is done in 
languages like C + + . s In Tecate, there is a single object 
creation operation called cloning. Any object in the 
system can serve as a prototype from which a copy can 
be made through the clone operation. A clone inherits 



properties from its prototype by copying the proto- 
type's properties, but any such property can be altered 
or removed, either by another object or by the clone 
itself, so that a clone can take on an identity of its own. 

The object model is based on delegation. When 
Tecate clones an object to produce a new object, 
the prototype's properties are not explicitly copied. 
Instead, the new object retains a reference to the 
object from which it was cloned. When a reference to 
a property is made within an object, the system looks 
for the property value locally within the obicct. If no 
property value is found locally, then the object's pro- 
totype is searched to associate a value with the refer- 
ence. I f the prototype is itself a clone, the prototype's 
prototype is recursively searched to resolve the refer- 
ence, and so on. This type of "lazy" evaluation of 
property references is called delegation. 

Note that with delegation, a change in value for 
a property in an object may af fect the values of all 
other objects that can trace their ancestry through 
prototype-clone relationships to the original object. 
This type of semantics is useful for establishing class- 
instance-like relationships between objects, l-'or exam- 
ple, one object may represent a particular class of 
automobile tire, and all clones of the object would 
represent class instances. If a class-level change is 
needed that would affect all instances, e.g., a new tread 
pattern is to be introduced, only the object represent- 
ing the tire class needs to change. 

The clone-prototype chaining implied by delega- 
tion can be overridden by changing the property 
values locally. Thus, if one particular tire instance is 
to have a new tread pattern, then the pattern is altered 
in that instance only. References to the tread pattern 
for that object will use the local tread value rather than 
chain back to the tire class object. All other instances 
will continue to reference the value present in the tire 
class object. 

All Tecate objects possess four classes of properties: 

1. Appearance — attributes that affect an object's 
visual appearance, such as geometric and topologi- 
cal structure, color, texture, and material properties 

2. Behaviors — a set of methods that are invoked upon 
receipt of messages from other objects 

3. State — a collection of variables whose values repre- 
sent an object's state 

4. Subobjects — a list of objects that are parts of a 
given object, just as a wheel is part of a car 

Although most users of the system uniformly see 
communicating objects, a distinction is actually made 
between two kinds of objects based on how they are 
implemented by applications programmers. Resource 
objects are implemented primarily as external processes 
using some compilable, general-purpose program- 
ming language such as C or Fortran. Objects that have 
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compute-intensive behaviors or whose behavior execu- 
tions are time-critical are generally implemented as 
resource objects. For instance, most Tecate objects that 
provide system services, such as rendering or database 
management, are implemented as resource objects. 

Objects populating virtual worlds that represent 
data features are implemented differently than 
resource objects by using an interpretive program- 
ming language called the Abstract Visualization Lan- 
guage (AVL). Such objects are called dynamic objects 
because they may be created, destroyed, and altered 
on-rhe fly as a Tecate session unfolds. Nonetheless, 
the ability to dynamically add, remove, and alter object 
properties is not solely endemic to dynamic objects. 
Resource properties may also be changed on -the fly 
because resources are actually implemented with a 
dynamic object that interfaces to the portion of the 
resource that is implemented as an external process. 

The Abstract Visualization Language 

AVL, is essential to the Tecate system; it is through AVL 
that applications programmers write applications that 
use Tccatc's features '' AVL is an interpretive, object 
oriented programming language that is capable of 
performing arbitrary computations and facilitating 
communication among dif ferent processes. Through 
this language, applications programmers specify and 
manipulate object properties and invoke object behav- 
iors by sending messages from one object to another. 

AVL is a typclcss language that manipulates char- 
acter strings; it is based on the Tel embeddable 
command language." 1 AVL extends Td by adding 
object-oriented programming support, 3-D graphics, 
and a more sophisticated event-handling mechanism. 
Although AVL is a proper superset of Tel, the relation- 
ship between AVL and Td is much like that between 
C and C + + . By adding a small set of new constructs 
to Tel, the way applications programmers structure 
AVL programs differs markedly from how they struc- 
ture Td programs, just as the C++ language exten- 
sions toC greatly alter the C programming style. 

One use of AVL is to describe virtual worlds that 
represent data sets. Through AVL, objects that popu- 
late these worlds can be assigned behaviors that are 
elicited through user interaction. Lor instance, select- 
ing a 3-D icon can cause a Universal Resource Locator 
(URL) to be followed out into the WWW. In this 
sense, AVL is somewhat like the Hypertext Markup 
Language (HTML) that underlies all Web browsers 
today, or, more fitting, it is similar to the Virtual 
Reality Modeling language (VRML) that has been 
proposed as a 3 D analog of HTML." AVL does, how- 
ever, differ markedly from HTML and VRML, which 
are only markup languages. Because AVL is a full- 
fledged programming language that has sophisticated 
interaction handling built in, it is philosophically more 
similar to interpretive languages like Telescript, 



NevvtonScript, and PostScript. 12 " Like Telescript, 
for instance, AVL programs can encode "smart 
agents" that can be sent across a network to perform 
user tasks at a remote machine, if an AVL interpreter 
resides there. Note, however, that in the present ver- 
sion of Tecate, there is no notion of security when 
arbitrary AVL code runs on a remote machine. 

AVL includes some additional commands that aug 
ment the Tel instruction set, for instance, clone and 
delete. The clone command is the object creation com- 
mand within AVL, and the delete command is the com- 
plementary operation to delete objects from the 
system. Object properties are specified and manipu- 
lated using the add command and deleted using the 
remove command. Behaviors in one object are initiated 
by another object using the send command, which 
specifies the behavior to invoke and the arguments to 
be passed. Queries about object properties can be 
made using the inquire command. The which com- 
mand is used todetermine where an object's properties 
are actually defined in light of Tecate's use of delega- 
tion to resolve property references. Finally, AVL pro- 
vides a rich set of matrix and vector operators that are 
useful when positioning objects v\nthin 3-D scenes. 

As an example of how AVL is used in practice, 
Figure 1 depicts a code fragment similar to one that 
appears in the WWW application described later in the 
paper. The code fragment creates a 3-D Web site icon 
that is positioned on a world map. The code begins 
with t he definition of the Hyperlink object from which 
all Web site icons are cloned. The H yperlink object is 
itself cloned from the Visual object that is predefined 
by Tecate at system start-up. The Visual object con- 
tains properties that relate to the viewing of objects 
within scenes. For instance, objects that are cloned 
from the Visual object inherit behaviors to rotate 
themselves and to change their color. To the proper- 
ties that are inherited from the Visual object, the 
Hyperlink object adds the state variables wr/and desc, 
which will be used to store respectively a URL and its 
textual description. In addition, objects cloned from 
the Hyperlink object will inherit the default appear- 
ance of a solid blue sphere having unit radius. 

The specification for the Hyperlink object also 
defines three behaviors: init, openUrl, and sbowDesc. 
The init behavior replaces the init method inherited 
from the Visual object. When an object cloned from 
the Hyperlink object receives an init message, it sets its 
url and desc state variables, positions itself within the 
scene whose name is given by the argument scene, and 
registers itself with the mouse handler agent to receive 
two events. When mouse button 1 is depressed, the 
agent sends the object the open Url message, which in 
turn requests the WWW Interface to fetch the data 
pointed to by the object's URL. Depressing burton 2 
invokes the shoivDcsc message, causing the Web site 
URL and description to be displayed by a previously 
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H Define a prototype for all Web icons 
clone Hyperlink Visual 

add Hyperlink { 
state { 
url "" 
desc "" 

} 

appearance { 

shape {sphere} 
diffuseColor {0.0 0.0 1.0} 
repType {surface} 

} 

behavior { 

U Initialize hyperlink 

init {url desc pos scene window} { 

addstate url $url 

addstate desc $desc 

send Cgetself] move "add $pos" 

add Sscene "subobject Cgetself}" 

send Swindow addEuent "Cgetself] {Button-1 {openllrl {}}} {Button-2 {showDesc {}}}" 

} 

H Open the URL 

openllrl {} {send www fetch "Cgetstate url]"} 
H Display the description 

showDesc {} {send metaViewer display "Cgetstate desc]"} 

} 

} 

U Initialize an informational landscape 
clone scene Visual 
clone window Viewer 
send window init {scene} 

U Create a Web site icon 
clone hlink Hyperlink 

send hlink init {"http://www.sdsc.edu/Home.html" 

"SDSC home page" "-2.3 -2.0 1.0" scene window} 

H Use the SDSC model geometry 

add hlink {appearance {shape {box}}} 



Figure 1 

An Implementation of a World Wide Web Icon in the Abstract Visualization Language 



defined interface widget called the melaViewer. The 
AVL command getself, which is used within the init 
behavior body, returns die name of the object on 
which the behavior was called, thus allowing applica- 
tions programmers to write generic behaviors. The 
other AVL commands, getstate and addstate, arc- 
shorthand for "get [getself] state ..." and "add [getself] 
{state...)." 

Once the Hyperlink object is defined, a scene, a dis- 
play window, and a Web site icon are created. The 
Tecate scene object is cloned from the Visual object. 
The window object, cloned from the predefined 
Viewer object, is the viewport into which the scene is 
to be rendered. Finally, blinkis a Web site icon whose 
appearance differs from that which is inherited from 
the Hyperlink object. Rather than being spherical, the 
shape of the blink icon is a unit cube. 



Tecate's Architecture 

The general structure of Tecate and how it relates to 
application programs is depicted in Figure 2. Tecate 
consists of a kernel, a set of basic system services, and a 
toolkit of predefined objects. The Tecate kernel, which 
is shown in Figure 3, is an object management system 
called the Abstract Visualization Machine; AVL is its 
native language. The Abstract Visualization Machine 
is responsible for creating, destroying, altering, ren- 
dering, and mediating communication between 
objects. The two major components of the Abstract 
Visualization Machine are the Object Manager and 
the Rendering Engine. 

The Object Manager is the primary component of 
the Abstract Visualization Machine. It is responsible for 
interpreting AVL programs, managing a database of 
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The Tccate System and Irs Relationship to Application 
Programs 

objects, mediating communication between objects, 
and interfacing with input devices. The Object 
Manager is itself a resource object that is distinguished 
by the fact that all other resource objects are spawned 
from this one object. In addition, the Object Manager 
is responsible for creating a distinguished dynamic 
object, caJled Root, from which all other dynamic 
objects can trace their heritage through prototype- 
clone relationships. 

The Object Manager is implemented on a simple, 
custom-built thread package. Each object within 
Tccate can be thought of as a process that has its own 
thread of control. Each thread can be implemented 
either as a lightweight process that shares the same 
machine context as the Object Manager's operating 
system process or as its own operating system process 
separate from that of the Object Manager. Lightweight 
processes are so named because their use requires little 
system overhead, which enables thousands of such 
processes to be active at any given time. Within Tccate, 
dynamic objects are implemented as lightweight 



processes, whereas resource objects are implemented 
as heavyweight operating system processes, which may 
or may not be paired with a lightweight, adjunct 
process. A low-level function library is provided to 
handle the creation and destruction of threads and 
to handle interthread communication regardless of 
how the threads are implemented. 

Closely allied with the Object Manager is the Ren- 
dering Engine, which is a special resource object 
wholly contained within the Abstract Visualization 
Machine. The Rendering Engine is responsible for 
creating a graphical rendition of a virtual world that is 
specified by AVL programs interpreted by the Object 
Manager. When interpreting an AVL program, the 
Object Manager strips off appearance attributes of 
objects and sends appropriate messages to the Ren- 
dering Engine so that it can maintain a separate display 
list that represents a virtual world. Display lists are rep- 
resented as directed, acyclic graphs whose connectivity 
is determined by object-subobject relationships that 
are specified within AVL programs. 

The present Rendering Engine implementation 
uses the Dore graphics package running on a DEC 
3000 Model 500 workstation. 7 The display lists that are 
created by invoking behaviors within the Rendering 
Engine are actually built up and maintained through 
Dore. The set of messages that the Rendering Engine 
responds to represents an interface to a platform's 
graphics hardware that is independent of both the 
graphics package and the display device. 

Layered on top of the Abstract Visualization 
Machine are Tecate's system services and the object 
toolkit. The system services consist of a collection of 
resource objects that are automatically instantiated at 
system start-up. These resources include an expert 
system called the Intelligent Visualization System, 
the Database Interface, the WWW Interface, and a 
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visualization programming system called BigRjver. 
Figure 3 shows these resources in relationship to 
Tecate's kernel. Each resource is a Tecare object that 
has a number of predefined behaviors that can be use- 
ful to applications programmers. For instance, the 
WWW Interface has a behavior that fetches a data file 
referred to by a URL and then translates the file's con- 
tents into an appropriate AVL program. 

The toolkit within Tecate is a set of predefined 
dynamic objects that programmers can use to develop 
applications. These objects are considered abstract 
objects in the sense that they are not intended to be 
used directly. Rather, they serve as prototypes from 
which clones can be created. The toolkit consists of 
objects such as viewports, lights, and cameras that are 
used to illuminate and render virtual worlds. The 
toolkit also contains a modest collection of 3-D user 
interface widgets that can be used within virtual 
worlds created by an applications programmer. These 
widgets include sliders, menus, icons, legends, and 
coordinate a,\es. 

One useful object in the toolkit that aids in simulat 
ing physical processes and helps in performing anima- 
tions is a clock. This object is an event generator thai- 
signals every clock tick. If objects wish to be informed 
of a clock pulse, those objects register themselves with 
the clock object just like objects register themselves 
with input device agent objects. The default clock 
object can be cloned, and each clone can be instanti- 
ated with a different clock period down to a resolution 
of one millisecond. Any number of clocks can be tick- 
ing simultaneously during a Tecate session. Since new 
clocks can be created dynamically, and objects can reg- 
ister and unregister to be informed of clock pulses 
on-the-fly, clocks can be used as timers and triggers, 
and as pacesetters. 

Application Resources 

Tecate's system services are predefined application 
resources that aid in interactively visualizing data. As 
mentioned previously, these objects include the Intel- 
ligent Visualization System, the Database Interface, 
the WWW Interface, and the BigRiver visualization 
programming system. In addition, an applications pro- 
grammer can easily add new application resources 
using tools provided with the base Tecate system. Such 
new resources can be built around either user-written 
programs or commercial off-the-shelf applications. 
To create a new application resource, a programmer 
needs to provide a set of functions that can be invoked 
by other Tecate objects. These functions correspond 
to behaviors that are called when the resource receiv es 
a message from other objects. Tools are provided 
to register the behaviors with Tecare and to manage 
the communication between a resource and other 
Tecate objects." 



The Intelligent Visualization System 

The Intelligent Visualization System allows Tecate to 
dynamically build interactive visualizations and user 
interfaces that aid nonexpert end users in exploring 
data spaces. This knowledge-based system is similar in 
concept to other expert visualization systems, as the 
literature describes."" 21 The Intelligent Visualization 
System differs from other expert visualization systems 
in two important ways. First, the Intelligent Visualiza- 
tion System does not merely create a presentation of 
information as do most other systems. Instead, the 
Intelligent Visualization System creates virtual worlds 
with which end users can interact to alter the way data 
is presented, to make queries for additional data, and 
to store new data back into data spaces. 

The second way the Intelligent Visualization System 
differs from expert visualization systems is that it takes 
a holistic approach to fashioning a visualization. Most 
systems decompose data into elementarv components, 
determine how to visualize each component separately, 
and then recompose the individual visualizations into 
a final presentation. In contrast, Tecate's Intelligent 
Visualization System analyzes the hill structure of data 
by relying on a sophisticated data model based on the 
mathematical notion of fiber bundles." 24 One way to 
view fiber bundles is as a generalization of the concept 
of graphs of mathematical functions. Depending on 
the character of a fiber bundle's independent and 
dependent variables, certain visualization techniques 
are more applicable than others. 

In general, the Intelligent Visualization System 
automatically crafts virtual worlds based on a task spec- 
ification and a description of the data that is to be visu- 
alized. A task specification represents a high-level data 
analysis goal of what an end user hopes to understand 
from the data. For instance, an end user may wish to 
determine if there is any correlation between tempera- 
ture and the density of liquid water in a climatology 
data set. Usually, task specifications must be input by 
an end user, although at times they can be inferred 
automatically by the system. Tecate provides a simple 
task language from which task specifications can be 
built, and it provides a poinr-and-click tool for end 
users to create these specifications when needed. Data 
descriptions, on the other hand, do not require any 
end-user input because they are provided automati- 
cally by a data-space interface when data is imported 
into the system. 

From the data description and task specification, 
a Planner within the Intelligent Visualization System 
produces a dataflow program that when executed 
builds an appropriate virtual world that represents 
a selected data set. The Planner uses a collection of 
rules, definitions, and relationships that are stored in 
a knowledge base when building a visualization 
that addresses a given task specification. Contents of 
the knowledge base include knowledge about data 
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models, user rasks, and visualization techniques. The 
Planner functions by constructing a sentence within a 
dataflow language defined by a context-sensitive graph 
grammar. At each step in the construction of the sen- 
tence, rules in the knowledge base dictate which pro- 
ductions in the grammar are to be applied and when. 
Presently, the knowledge base is implemented using 
the Classic knowledge representation system; the 
Planner is implemented in CLOS.-'^ 6 

BigRiver 

The dataflow program produced by the Intelligent 
Visualization System is written in a scripting language 
that is interpreted by BigRiver, a visualization pro- 
gramming system similar to AVS and Khoros.' 3 From 
a technical standpoint, BigRiver is not particularly 
innovative and will eventually be reimplemented using 
some existing visualization system that has more func- 
tionality. The reason that BigRiver was created from 
scratch was to better understand how existing visual- 
ization programming systems work and to overcome 
limitations within those systems. These limitations are 
their inability to be embedded within other applica- 
tions, their lack of comprehensive data models, and 
their inability to work with user-supplied Tenderers. 
The latest generation of visualization programming 
systems, such as Data Explorer and AVS/Express, 
overcome many of these limitations/ : 

Like most of the existing visualization systems, 
BigRiver consists of a collection of procedures called 
modules, each of which has a well-defined set of inputs 
and outputs. Functional specifications for these mod- 
u les represent some of the knowledge contained in the 
Intelligent Visualization System's knowledge base. 
Visualization scripts that are interpreted by BigRiver 
specify module parameter values and dictate how the 
outputs of chosen modules are to be channeled into 
the inputs of others. 

BigRiver modules come in three varieties: I/O, data 
manipulators, and glyph generators. All modules use 
self-describing data formats based on fiber bundles. 
One format is used for manipulation within memory; 
the other is an on-rhe-wire encoding intended for 
transporting data across a network. An input module 
is responsible for converting data stored in the on-rhe- 
wire encoding into the in- memory format. The data 
manipulator modules transform fiber bundles of one 
in-memory format into those of another. The glyph 
generators take as input fiber bundles in the in-mcmorv 
format and produce AVL programs that when executed 
build virtual worlds containing objects that represent 
features of selected data sets. A single display module 
takes as input AVL code and passes it to the Abstract 
Visualization Machine. By means of the Rendering 
Engine, the Abstract Visualization Machine uses the 
appearance attributes of objects to create an image of 
a virtual world that contains the objects. 



The Database Interface 

The Database Interface provides the means to interact 
with a database management system, which in the cur- 
rent version of Tecare can be either POSTGRES or 
Illusrra.- s - 2y Database queries, written in POSTQUEL 
for POSTGRES-managed databases or in SQL for 
Illustra databases, are sent to the Database Interface by 
Tecate objects where they are passed to a database 
management system server for execution. The server 
returns the query results to the Database Interface, 
which then attempts to package them up as an on-the- 
wire encoding of a fiber bundle buffered on local disk. 
If the result is a set of tuples in the standard format 
returned by POSTGRES or Illustra, the Database 
Interface performs the fiber bundle translation. For 
most other nonstandard results, the so-called binary 
large objects (BLOBs) of the database realm, the 
Database Interface cannot yet arbitrarily perform the 
translation into the on-the-wire fiber bundle encod- 
ing. The only BLOBs that the Database Interface can 
deal with presently are those that are already encoded 
as on-rhe-wire fiber bundles. The difficult problem of 
automated data format translation was not addressed 
during Tecate's initial development, although the 
intent is to address this issue in the future. 

Once query results are buffered on disk, a descrip- 
tion of the fiber bundle and the location of the buffer 
are sent back to the object that made the query 
request of the Database Interface. That object might 
then request the Intelligent Visualization System to 
structure a virtual world whose image would appear 
on the display screen by way of BigRiver and the 
Rendering Engine. Objects in the virtual world can be 
given behaviors that are elicited by user interactions. 
These behaviors might then result in further database 
queries and so on. Chains of events such as these pro 
vide a means for browsing databases through direct 
manipulation of objects within a virtual world. 

The World Wide Web Interface 

The WWW Interface functions similarly to the 
Database Interface but instead of accessing data in 
a database, the WWW Interface provides access to data 
stored on the World Wide Web. Messages that contain 
URLs are passed to the WWW Interface, which then 
fetches the data pointed to by the URLs. In retrieving 
data from the Web, the WWW Interface uses the same 
CERN software libraries used by Web browsers like 
Netscape. 

Once a data file is fetched, the WWW Interface 
attempts to translate its contents into an AVL pro- 
gram, which is then passed to the Object Manager for 
interpretation. AVL either specifies the creation of 
a new virtual world that represents the data file's con- 
tents or specifies new objects that are to populate the 
current world being viewed. If the fetched data file 
contains a stream of AVL code, the WWW Interface 
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merely forwards the file to the Object Manager. If the 
file contains general data in the form of an on-the-wire 
encoding of a fiber bundle, the WWW Interface 
appeals to the Intelligent Visualization System to 
structure an appropriate virtuaJ world. If the data file 
contains a stream of HTML code, the WWW Interface 
invokes an internal translator that translates HTML 
code into an equivaJent AVL program, which is then 
interpreted by the Object Manager. This interpreter 
actually understands an extended version of HTM L 
that supports the direct embedding of AVL within 
HTML documents. Through this mechanism, 3-D 
objects with which users can interact can be embedded 
directly into a hypertext Web page — something that 
few if any other Web browsers can do today. 

Example Applications 

Applications that browse the contents of data spaces 
and then interactively visualize selected results have 
the same overall structure. One browser application 
component acts as a data space interface, and through 
this interface queries arc posed, query results are 
imported into the application, and data generated by 
the application is stored back into a data space. Once 
data has been imported into the application, a second 
component must map the data into some appropriate 
virtual world. Finally, a third component must manage- 
any interactions that may take place between an end 
user and elements that populate the virtual worlds that 
are created. 

In creating an application using Tecate, the Database 
Interface and the WWW Interface represent resources 
that can be used to form the application's data space 
interface. The mapping of data into a representative 
virtual world can utiJize Tecate's Intelligent Visuali- 
zation System and the BigRiver visualization program- 



ming system. Finally, the management of these worlds 
can take place through AVL programs that exercise the 
features of Tecate's Abstract Visualization Machine. 
The following two examples that were implemented in 
AVL illustrate how Tecate can be used to create applica- 
tions that browse data spaces. 

Visualizing Data in a Database 

A simple example of an application that exploits 
Tecate's features is one that browses for earth science 
data in a database and then provides visualizations of 
that data. The initial user interface for this application is 
built using a collection of user interface widgets, where 
each widget is a Tecate dynamic object. Because the 
Tecate system docs not yet have a comprehensive 3-D 
widget set, some widgets still rely on two-dimensional 
(2-D) constructs provided by the Tk widget set that 
is implemented on top of the Tel language. "' 

Figure 4 depicts the flow of messages between some 
of the more important objects that are used within the 
application. One object is the Map Query Tool that 
is used to make certain graphical queries for earth 
science data sets whose geographical extents and rime 
stamps fall within user-specified constraints. The tool 
is built around a world map on which regions of inter- 
est can be specified (see Figure 5). When a user marks 
a region of interest on the map and selects a temporaJ 
range, a query message is sent to the Database 
Interface. The result of the query is returned to the 
Map ^uery Tool, which then forwards a description 
of the result to the Intelligent Visualization System. 
To structure an appropriate visualization, an inferred 
select task directive accompanies the result. The ensu- 
ing script produced by the Intelligent Visualization 
System is executed by BigRiver, which produces a 
stream of AVL code that is sent to the Abstract 
Visualization Machine for interpretation. 
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Figure 5 

The Map Query Tool Show ing a Visualization of a Query Result 



This AVL program creates a new virtual world that 
consists of a collection of 3-D objects. Each object acts 
as an icon that corresponds to one data set that was 
returned as the result of the initial query (see Figure 
5). The Intelligent Visualization System also builds 
in two behaviors for each icon Depending on how 
a user selects an icon, either the metadata associated 
with the data set represented by the icon is displayed in 
a separate window or a query message is sent to the 
Database Interface requesting the actual data. In 
die latter case, the Map Query Tool again forwards 
the query result to the Intelligent Visualization System, 
and another virtual world containing objects repre- 
senting data features is created and displayed with 
the aid of BigRiver and the Abstract Visualization 
Machine. In general, data exploration proceeds this 
way by creating and discarding virtual worlds based on 
interactions with objects that populate prior worlds. 

After selecting an icon to actual ly view the data asso- 
ciated with it, an end user is asked by the Intelligent 
Visualization System to input a task specification using 
a Task Editor. Generally, data sets can be visualized in 



many different ways. The Intelligent Visualization 
System uses the task specification to select the one 
visualization that best satisfies the stated task. After 
a task specification is entered, a visualization of the 
selected data set appears on the screen. The BigRiver 
dataflow program that the Intelligent Visualization 
System creates to do that visualization can be edited by 
hand by knowledgeable end users to override the deci- 
sions made by the system. 

Figure 6 shows a Task Editor and a visualization 
crafted by the Intelligent Visualization System after an 
end user selected a data-set icon. The visualization rep- 
resents hydrological data that consists of a collection of 
tuples, each corresponding to a set of measurements 
made at discrete geographical locations. Based on 
the task specification that the end user entered, the 
Intelligent Visualization System chose to map the data 
into a coordinate system that has axes that represent 
latitude, longitude, and elevation. Each sphere repre- 
sents an individual measurement site, whose color is 
a function of the mean temperature. When an end user 
selects a sphere, the actual data values associated with 
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Task Heritor Showing a Visualization of Hvdrologieal Data 



rhc location represented by the sphere are displayed . 
In addition, the Intelligent Visualization .System auto- 
matically places into the virtual world of the visualiza- 
tion a color legend to help relate sphere colors to mean 
temperature values. 

Figure 7 depicts another virtual world showing a 
visualization of data-set output From a regional climate 
model program. The data set is a 3-0 array indexed bv 
latitude, longitude, and elevation. F.ach array clement 
is a tuple that contains cloud density, water content, 
and temperature values. In this instance, the end user 
entered a task specification that stilted that the spatial 
variation in temperature was of primary importance. 
The Intelligent Visualization System responded bv 
specifying a visualization that represented the temper- 
ature data as an isosurtace, i.e., a surface w hose points 
all have the same value for the temperature. Included 
in the virtual world is a widget that can be used to 
change the isostirface value and the field variable that 
is being studied. 

The isosurtace widget that appears in the visualiza- 
tion shown in Figure 7 is of special interest because of 
the wav that it is implemented. Embedded in the tool 
is a slider that is used to change the isosurtace value. As 
with most sliders, the slider value indicator automati- 
cally moves when a mouse button is held down while 



pointing at one of the slider ends. To achieve this sim- 
ple animation, Tecate's clock object is used. When the 
mouse button is first depressed while the cursor is over 
a slider end, the slider indicator registers itself to be 
informed of clock ticks. From then on, at every clock 
rick, the indicator receives an update message from the 
clock, at which time the indicator repositions itself and 
increments or decrements the current slider value. 
When the mouse burton is released, the slider sends 
a message to BigRivcr indicating that a new isosurtace 
is to be calculated and displayed. In addition, the slider 
indicator unregisters itself from the clock signaling 
that it no longer is to receive the update messages. In 
general, applications can use this same clock mecha- 
nism to perform more elaborate animations. 

A 3-D World Wide Web Browser 

In the Tecate Web browser, exploration of the World 
Wide Web and its contents occurs by placing an end 
user onto an informational landscape. This landscape 
is a 3-D virtual world whose appearance reflects the 
content anil the structure of a designated subset of the 
entire Web. Upon application start-up, an end user 
is presented with an initial informational landscape 
that consists of a planar map of the earth embedded 
in a 3-D space, as shown in Figure 8. In general, the 
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Task Editor Showing a Visualization ot Regional Climate Data, Including an Ixosurtacc and a User Interface Widgcr 

initial informational landscape can be any 3-D scene 
and does not have to be geographically based. For 
instance, an informational landscape might be a virtual 
library where books on shelves serve as anchors for 
hyperlinks to different Web sites. 

In the present browser application, selected Web 
sites appear as 3 D icons on the world map. These 
icons are positioned either in locations where Web 
servers physically reside or in locations referenced 
within Web documents (see Figure 8). A user places 
information that describes these sites into a database 
that serves as an elaboration of the hot list of current 
hypertext- based browsers. When the browser applica- 
tion is first started, it sends a query for the initial com- 
plement of Websites to the Database Interface. The 
browser application then invokes a BigRiver script that 
visualizes the results by placing icons representing 
each site onto the world map. 

Suspended above the world map is a 3-D user inter- 
face widget that is used to query a database of Web 
sites that are of interest to an end user (see Figure 8). 
This database, where the initial set of Web sites is 
stored, includes information such as URLs, keywords, 
geographical locations, and Web site types. Currently, 




Figure 8 

Tccatc Web Brow ser Informational Landscape Showing 
WWW Sites Depicted as 3-D Icons on a Map of the World 



Digital Technical Journal 



Vol. 7 No. 3 1995 77 



individual users arc responsible for maintaining their 
own databases by adding or removing Web sire entries 
by hand. An automated means for building these data- 
bases can be easily added to the browser application so 
that Web information could be accumulated based on 
where and when an end user travels on the Web. 

During a browsing session, the Web Query fool 
allows arbitrary SQL. queries to be posed to the 
database by an end user. In addition, the Web Query 
Tool has provisions to allow packaged queries to be 
initiated by a simple click of a mouse button. In both 
cases, queries are sent to the Database Interface for 
forwarding to the appropriate database server. The 
Database Interface packages up the query results as 
on-rhe-vvire fiber bundles which are returned to rhe 
Web Query Tool. The Web Qucrv Tool then invokes 
a BigRiver script, which converts the fiber bundle data 
into AVL code. This code, when interpreted by rhe 
Object Manager, creates a visualization of the Web 
sites that satisfies the query. Generally, a visualization 
such as this consists of placing on the world map a set 
of 3-D icons whose appearances are a function of the 
Web site type. However, query result visualizations 
need not be limited to an organization based on geo- 
graphical position. For instance, a query for rhe con- 



tents of an end user's own file directory results in a 
new informational landscape that consists of an evenly 
spaced grid of icons suspended within a room, as 
shown in Figure 9. 

Kadi icon that appears within an informational 
landscape is cloned from an AVL Hyperlink abstract 
object that stores its URI . in a state variable. Kach Web 
sire icon inherits from the Hyperlink prototype a 
behavior that causes data pointed to bv its URL state 
variable to be fetched by means of the WWW Interface 
when the icon is selected. When the data is drawn 
across the Web, Tecate's WWW Interface attempts to 
structure a visualization of it. Figure 10 summarizes 
the message flow between the more important objects 
within the Web browser application. 

If an end user selects an icon and a Web server 
returns a stream of HTML, the WWW Interface trans- 
lates rhe stream into AVL and displavs the result on rhe 
base of an inverted pyramid whose apex is centered on 
the chosen icon (see Figure 1 1 ). The text and imagery 
resulting from the HTML appear similarly as they 
would when visualized using a hypertext- based 
browser like Netscape. Hyperlinks are represented as 
highlighted text, which the user can follow by select- 
ing rhe text. These hvperlinks are Tecare objects that 
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Sample End-user Nongcographic.il Informational Landscape 
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are cloned from the same Hyperlink prototype as the 
Web site icons. If another HTML document is 
retrieved by following a hyperlink, that document 
is viewed on the base of another inserted pyramid 
whose apex rests on the selected text and so on (see 
Figure 1 1). Rather than having to page back and forth 
between hypertext documents as with most hypertext- 
based browsers, in Tecate, an end user needs only to 
move about the virtual world to gain an appropriate 
viewpoint from which to examine a desired document. 
Overall, as shown in Figure I 1, a browsing session 
with Tccate's Web browser results in a forest of pyra- 
midal structures that represent a pictorial history of 
an end user's travels on the Web. 

Although Tecate' s Web browser is capable of view- 
ing HTML documents, its main purpose is not to 
emulate what can currently be done using hvpertext- 
based browsers, albeit using 3-D. Rather, the new 
browser is intended to visualize primarily more com- 
plex types of data. When data does not consist of 
a stream of HTML code, the VVVVVV Interface attempts 
to visualize what was returned from the Web. These 
visualizations can take place in virtual worlds separate 
fr om the informational landscape from where the data 



request was initiated, or they can be placed within 
the original informational landscape. Figure 1 2 depicts 
an example of a Web document that has embedded 
within it a miniature virtual world containing a model 
of a car. An end user can freelv interact with this model 
to initiate any behavior defined for objects populating 
the subworld. For instance, selecting the car with the 
mouse causes the car wheels to spin. Figure 1 3 shows 
the AVI , code embedded in the HTML page for the 
Web document shown in Figure 12. 

Conclusions 

Tecate provides the infrastructure on which applica- 
tions can be created for browsing and visualizing data 
from networked data sources. Architecturally, Tecate 
seeks to bring together into one package useful fea- 
tures found in visualization systems, network browsers, 
database front ends, and virtual reality systems. As a 
lirst prototype, Tecate was created using a breadth-firsr 
development strategy. That is, developers deemed it 
essential to first understand what components were 
needed to build a general data space exploration utility 
and then determine how those components interact. 
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Figure 12 

Example of a Web Document with Hmbcdded 3-D Virtual World 
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< H E A D > 

<TITLE>The Tecate car demo</TITLE> 

< / H E A D > 

<BODY> 

<Hl>The Tecate car demo</H1> 



<AVL> 

H GLobaL variables 

global T E C_W EB_PARENT TEC_WEB_WIN 
set path " / p r o j e c t s / s 2 k / s h a r ed a t a " 



# Define car part prototype 
clone CarPart Visual 
add CarPart { 

state {angle 10} 
appearance { 

repType surface 
interpType surface 

> 

behavior { 

around (args) { 

for {set i 0} {$i < 360} {incr i [getstate angle]} { 
send [getseLf] rotate "add 0 [getstate angle] 0" 

} 

} 

} 



H Define car body 
clone car_body CarPar 
add c a r_b o d y { 
appearance { 

replacematrix { 
shape {AliasObj 



t 

rotate {0.0 0.0 90.0}} 
"$path/car_body. tri"} 



# Define generic wheel 
clone wheel CarPart 



H Define car's four wheels 
clone back_right CarPart 



# Assemble car 
clone wheels CarPart 

add wheels {subobject {back_right back_left front_left f ro n t_r i g h t } } 
clone car CarPart 
add car { 

appearance {replacematrix {translate {28.0 -8.0 3.0} rotate {90.0 90.0 0.0}}} 
subobject (car_body wheels} 

} 

add $ T E C_W E B_P A R E N T {subobject {car}} 
tt Bind pick events to car 

send $TEC_WEB_WIN addEvent {wheel {Pick-Shift-Button-1 {rot_wheels {}}}} 
send $TEC_WEB_WIN addEvent {wheel { P i c k -Bu t t o n - 1 {around {}}}} 
send $TEC_WEB_WIN addEvent {car {Pick-Button-1 {around {}}}} 
</AVL> 



<PRE> 

Button-1 on car to rotate the car <BR> 

Button-1 on a wheel to rotate the wheels <BR> 

Shift Button-1 on a wheel to change the wheels <BR> 

</PRE> 



<HR> 
<P> 

</B0DY> 



Figure 13 

AVI. Code Km bedded in the HTML Page for the VVcb Document Kxample 
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This development strategy traded off the functionality 
of individual components for the completeness of 
a fully running visualization system, 

In terms of achieving its design goals, the Tecate 
effort has been moderately successful. Tecate can now 
provide interfaces to two kind of data spaces: the 
World Wide Web and databases managed bv the 
POSTGRBS and Illustra database management sys- 
tems. In addition, interfaces to other data spaces can 
be implemented easily by creating new resource 
objects using the tools provided bv Tecate. Much 
work still needs to be done, how "Over. For example, the 
attendant data translation problem must be satisfacto- 
rily solved; data passing through an interface that 
is stored in one format should be automatically con- 
verted into Tecate \s favored f ormat and vice versa. 

When building visualizations of data, Tecare now 
understands data that has a specific conceptual struc- 
ture, in particular, arbitrary sets of tuples and multi- 
dimensional arrays where array elements may be 
tuples. Although data types from manv different disci- 
plines possess such a structure, some types remain that 
do not, for instance, data that has a lattice-like or poly- 
hedral structure. Furthermore, Tecate can now con- 
struct only crude visualizations of the data types that it 
does understand. The primary reason for this short- 
coming is that the basic module set within the 
BigRiver resource is incomplete, and the knowledge 
base within the Intelligent Visualization System con- 
tains limited knowledge of visualization techniques 
that can be used to transform data into virtual worlds. 

At present, Tecate does dynamically craft simple user 
interfaces and interactive visualizations using its Intelli- 
gent Visualization System. This expert system takes into 
account how data is conceptually structured and end- 
user tasks regarding what is to be understood from the 
data. Still, the Intelligent Visualization System does not 
vet consider data semantics, end-user preferences, or 
display svsrem characteristics when building visualiza- 
tions. Nonetheless, Tecare does provide the capabilities 
to create highly interactive applications. Sophisticated 
event handling constructs are built into AVI,, and the 
Intelligent Visualization Svsrem uses those features to 
automatically place user interface widgets into the 
virtual worlds it specifies. 

Regarding future work, hopefully, succeeding gen- 
erations of the Tecate system will include many new 
features and enhancements. The management of 
objects needs to be reworked so that thousands of 
objects cm be efficiently handled simultaneously. 
Although Tecate now builds virtual worlds, virtual 
reality gadgetrv has vet to be integrated into the sys- 
tem. The Abstract Visualization Language needs new 
features, and it needs to be streamlined. Tecate can 
also benefit greatly from a more complete toolkit of 
3-D widgets that can be used to interact with objects 
within virtual worlds. Finally, the Dore graphics sys- 



tem that Tecate uses should be replaced with a more 
mainstream svstem like OpenGL, which will allow 
Tecate to run on a wide variety of hardware platforms. 

Tecate is an exciting svstem to use and an excellent 
foundation from vv hich to pursue further research and 
development in the exploration of general data spaces. 
Tecate advances the state of the art by demonstrating a 
comprehensive means to graphically browse for data 
and then interactively visualize data sets that are 
selected. Tecate accomplishes these tasks by using an 
expert svstem that automatically builds virtual worlds 
and by exploiting the flexibility of an interpretive, 
object-oriented language that describes those worlds. 
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Software in Sequoia 2000 
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The Sequoia 2000 project requires a high-speed 
network and I/O software for the support of 
global change research. In addition. Sequoia 
distributed applications require the efficient 
movement of very large objects, from tens to 
hundreds of megabytes in size. The network 
architecture incorporates new designs and 
implementations of operating system I/O soft- 
ware. New methods provide significant per- 
formance improvements for transfers among 
devices and processes and between the two. 
These techniques reduce or eliminate costly mem- 
ory accesses, avoid unnecessary processing, and 
bypass system overheads to improve through- 
put and reduce latency. 



In rhc Sequoia 2000 project, \vc addressed rlie prob- 
lem of designing a distributed computer system that 
can efficiently retrieve, store, and transfer the very 
large data objects contained in earth science applica- 
tions. By verv large, we mean data objects in excess 
of tens or even hundreds of megabytes (MB). Earth 
science research has massive computational require- 
ments, in large parr due to the large data objects often 
found in its applications. There arc many examples: an 
advanced very high-resolution radiometer (AVHRR) 
image cube requires 300 MB, an adv anced visible and 
infrared imaging spectrometer (AVIRIS) image 
requires 140 MB, and the common kind satellite 
(LANDSAT) image requires 278 MB. Any throughput 
bottleneck in a distributed computer svstem becomes 
grearlv magnified when dealing with such large 
objects. In addition, Sequoia 2000 was an experiment 
in distributed collaboration; thus, collaboration tools 
such as videoconferencing were also important appli- 
cations to support. 

Our efforts in the project focused on operating sys- 
tem I/O and the network. We designed the Sequoia 
2000 wide area network (WAN) rest bed, and we 
explored new designs in operating svstem I/O and 
network software. The contributions of this paper are 
twofold: ( 1 ) it surveys the main results of this work 
and purs rhem in perspective by relating them to the 
general (.lata transfer problem, and (2) it presents 
a new design for container shipping. (For a complete 
discussion of container shipping, see Reference 1.) 
Since container shipping is a new design, this paper 
devotes more space to it in relation to the other sur- 
veyed work (whose derailed descriptions may be found 
in References 2 to 9 ). In addition ro this work, w e con- 
ducted other network studies as part of the Sequoia 
2000 project. These include research on protocols to 
provide performance guarantees and multicasting. ~ 

To support a high-performance distributed comput- 
ing environment in which applications can effectively 
manipulate large data objects, we were concerned with 
achiev ing high throughput during the transfer of these 
objects. The processes or devices representing the data 
sources and sinks may all reside on the same work- 
station (single node case), or rhev mav be distributed 
over manv workstations connected by the network 
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(multiple node case). In either ease, we wonted appli- 
cations, be they earth science distributed computa- 
tions or collaboration tools involving multipoint 
video, to make full use of the raw bandwidth provided 
bv the underlying communication system. 

In the multiple node case, the raw bandwidth is 
from 45 to 100 megabits per second (Mb/s), because 
the Sequoia 2000 network used T3 links for long- 
distance communication and a fiber distributed data 
interface ( FDDI ) for local area communication. In the 
single node case, the raw bandwidth is approximately 
100 mcgabvtcs per second, since the workstation of 
choice was one of the DKCstation 5000 scries or the 
Alpha-powered DEC 3000 series, both of which use 
the TL' RBOchanncI as the system bus. 

Our work focused only on software improv ements, 
in particular how to achieve maximum system software 
performance given the hardware we selected. In fact, 
we found that the throughput bottlenecks in the 
Sequoia distributed computing environment were 
indeed in the workstation's operating system softw are, 
and nor in the underlying communication svstem 
hardware (e.g., network links or the svstem bus). This 
problem is not limited to the Sequoia env ironment: 
given modern high-speed workstations (100+ millions 
of instructions per second [mips]) and fast networks 
(100+ Mb/s), performance bottlenecks arc often 
caused by software, especially operating svstem soft- 
ware. System software throughput has not kept up 
with the throughputs of I/O devices, especially net- 
work adapters, which have improved tremendously 
in recent vcars. These technology improvements arc 
being driven by a new generation of applications, such 
as interactive multimedia involving digital video and 
high-resolution graphics, that have high I/O through- 
put requirements. Supporting these applications and 
controlling these devices hav e taxed operating svstem 
technology, much of which was designed during times 
when intensive I/O was nor an issue. 

In the next section of this paper, we describe the 
Sequoia 2000 network, which serv ed as an experimen- 
tal rest bed for our work. Following that, we analyze 
the data transfer problem, which serves as the context 
for the three subsequent sections. There we describe 
our solutions to the data transfer problem. Finally, w e 
present our conclusions. 

The Sequoia 2000 Network Test Bed 

The Sequoia 2000 network is a private WAN that we 
designed to span five campuses at the University of 
California: Berkeley, Davis, Los Angeles, San Diego, 
and Santa Barbara. The topology is show n in Figure 1 . 
The backbone link speeds are 45 Mb/s (T3) with 
the exception of the Berkeley- Davis link, which is 
1.5 Mb/s (Tl ). Ar each campus, one or more FDDI 




Figure 1 

Sequoia 2000 Research Network 

local area networks ( LAN's) that operate ar 100 Mb/s 
are used for local distribution. Ar some campuses, 
the configuration is a hierarchical set of rings. For 
example, at UC San Diego, one FDDI ring covered 
the campus and joined three separate rings: one at 
the Computer Systems Lab (our laboratory) in the 
Department of Computer Science and Engineering, 
one at the Scripps Institution of Oceanography, and 
one at the San Diego Supercomputer Center. 

We used high-performance general-purpose com- 
puters as routers, originally DECstation 5000 series 
and later DHC 3000 series (Alpha powered) work- 
stations. Using workstations as routers running the 
ULTRIX or the DEC: OSF/1 (now Digital UNIX) 
operating system provided us with a modifiable soft- 
ware platform for experimentation. The T3 (and Tl ) 
interface boards were specially built by David Boggs at 
Digital. We used off- the shelf Digital products for 
FDDI boards, both models DEFTA, which supports 
both send and receive direct memory access (DMA), 
and DFT'ZA, which supports only receive DMA. 

The Data Transfer Problem 

Since a data source or sink may he either a process or 
device, and the operating svstem generally performs 
the function of transferring data between processes 
and devices, understanding the bottlenecks in these 
operating system data paths is key ro improving 
performance. These data paths generally involve tra- 
versing numerous layers of operating svstem software. 
In the case of network transfers, the data paths are 
extended by layers of network protocol software. 
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To understand the performance problem we were 
trying to solve, consider a common client-server inter- 
action in which a client has requested data from a 
server. The data resides on some source device, e.g., a 
disk, and must lx read bv the server so that it nv.iv send 
the data to the client ov er a network. Ar the client, the 
data is written to some sink dev ice, e.g., a frame buffer 
for displav. 

Figure 2 shows a typical end-to-end data path where 
the source and sinkend-point workstations arc running 
protected operating svstem kernels such as UNIX. The 
source device generates data into the memory of its 
connected workstation. This memorv is generallv onlv 
addressable bv the kernel; to allow the server process 
to access the data, it is phvsicallv copied into memorv 
addressable via the server process's address space, i.e., 
user space. Phvsicallv copying data from one memory 
location to another (or more generallv, touching the 
data for any reason) is a major bottleneck in modern 
workstations. 

In travelling through the kernel, the data generally 
travels over a device las er and an abstraction layer. The 
dev ice laver is part of the kernel's I/O subsystem and 
manages the I/O devices by buffering data between 
the device and the kernel. The abstraction laver com- 
prises other kernel subsystems that support abstrac- 
tions of devices, providing more convenient serv ices 
lor user-level processes. Examples of kernel abstraction 
laver software include file systems and communication 
protocol stacks: a file svstem converts disk blocks into 
files, and a communication protocol stack converts 
network packets into datagrams or stream segments. 
Sometimes, a kernel implementation mav cause physi- 
cal copying of data between the device laver and the 
abstraction laver; in fact, copying mav even occur 
within these layers. 



From kernel space, the data may travel across several 
more layers in user space, such as the standard I/O 
laver and the application layer. The standard I/O laver 
buffers I/O data in large chunks to minimize the 
number of I/O svstem calls. The application laver gen- 
erally has its own buffers where I/O data is copied. 

from the server process in user space, the data is 
then given to the network adapter; this mav cause 
transfers across user process layers anil then across the 
kernel layers. The data is then transferred over the net- 
work, which generally consists of a set of links con- 
nected by routers. If the routers hav e kernels whose 
software structure is like that described above, a simi- 
lar (but tvpicallv simpler) intramachine data transfer 
path will apply. 

Finally, the data arrives at the client's workstation. 
There, the data travels in a similar way as was described 
for the server's workstation: from the network adapter, 
across the kernel, through the client process's address 
space, and across the kernel again, finally reaching the 
sink device. 

From this analysis, one can surmise vvhv throughput 
bottlenecks often occur ar the end points of the end- 
to-end data transfer path, assuming sufficiently fast 
hardw are devices and communication links. At the end 
points, there may be significant data copying as the 
data traverses the various software layers, and there is 
protection-domain crossing (kernel to user to kernel ), 
among other functions. The overheads caused bv these 
functions, directly and indirectly, can be significant. 

Consequently, we focused on improving operating 
svstem I/O ami network software, including opti- 
mizations for the four possible process/device data 
transfer scenarios: process to process, process to device, 
device to process, and device to device, with special 
care in addressing cases where either source or sink 
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device is a network adapter. In this paper, \ve use 
the term clala transfer problem to refer to the problem 
or" reducing these overheads to achieve high through- 
put between a source device and a sink device, either 
of which can be a network adapter within a single 
workstation. 

Although the data transfer problem may also exist in 
intermediate routers, it does so to a much lesser 
degree than with end-user workstations (assuming 
modern router software and hardware technology). 
This is because of a router's simplified execution envi- 
ronment and its reduced needs for transfers across 
multiple protected domains. Howev er, there is noth- 
ing that precludes the application of the techniques 
discussed in this paper to router software. In fact, since 
we used general-purpose workstations for routers to 
support a flexible, modifiable rest bed for experimen- 
tation with new protocols, our work was also applied 
to router software. 

In the next three sections, we describe various 
approaches to solving the data transfer problem. Since 
data copying/touching is a major software limitation 
in achieving high throughput, avoiding data copying/ 
touching is a constant theme. Much of our work 
involves finding ways to av»id or limit touching the data 
without sacrificing the flexibility or protection com- 
monly provided by most modern operating systems. 

We describe two solutions to the data transfer prob- 
lem that avoid all physical copying and are based on 
the principle of providing separate mechanisms for 
I/O control and data transfer. IS -' The reader will see 
that while these two solutions are based on different 
approaches (indeed, rhev can even be viewed as com- 
peting), thev fill different niches based on differing 
assumptions of how I/O is structured. In other words, 
each is appropriate and optimal for different situations. 
In addition to the data transfer problem, we address a 
special problem — the bottleneck created bv the check- 
sum computation for I/O on a network using the trans- 
mission control protocol/internet protocol (TCP/IP). 

Container Shipping 

Container shipping is a kernel service that provides 
I/O operations for user processes. High performance 
is obtained by eliminating the in-mcmorv data copies 
traditionally associated with I/O. Additional gains are 
achieved by permitting the selective accessing (map- 
ping) of data. Finally, the design we present makes 
possible specific optimizations that further improve 
performance. 

The goals of the container shipping model of data 
transfer for I/O are to provide high performance with- 
out sacrificing protection and to hilly support the prin- 
ciple of general-purpose computing. Full access to 
I/O data by user-level processes has long been a stan- 
dard feature of operating systems. This ability has 



traditionally been provided by copying data to and 
from process memorv at each instance when data is 
transferred. The divergence of CPU and dynamic ran- 
dom access memory (DRAM) speeds makes this in- 
mcmorv copying more inefficient and costlv every 
year. This problem is often attacked with application- 
specific silicon or kernel modifications. A less-costlv 
and longer-lasting solution is to redesign the I/O sub- 
system to provide copy-free I/O, Container shipping 
provides this ability, as well as additional performance 
gains, in a uniform, general, and practical wav. 

Containers 

A container is one or more pages of memory. In these 
pages, it may contain a single block of data, whose 
location is identified bv an offset and a length. When 
a container is mapped into an address space, the pages 
form a contiguous region of memory, w here the data 
can be manipulated . A container can be owned bv one 
and onlv one domain, e.g., some user process or the 
kernel itself, at any single point in time. The owning 
domain may map the container for access. When 
access is not required, mapping can be avoided, which 
saves time. 

User-level processes use container shipping system 
calls to perform the following functions: 

■ Allocation: cs_alloc and cs_fiec allocate and deallo- 
cate containers and their resources (e.g., physical 
pages). 

■ Transfer: cs_read and cs_vvrite perform I/O using 
containers. 

■ Mapping: cs_map and cs_unmap allow a process to 
access the data in a container. 

The cs_read and cs_w rite calls take as parameters an 
I/O path identifier (such as a UNIX file descriptor), 
a data size, and parameters describing a list ofcontain- 
ers, or a return area for such a list. Several options are 
also available, such as one for cs_read that immediately 
maps all the resulting containers. Data is never copied 
within memory to satisfy cs_iead and cs_vvrite, so all 
I/O performed this way is copy-free. 

Because the mapping of containers is always 
optional, a process can move data from one device to 
another without mapping it .if all. When containers of 
data flow through a pipeline of several processes, sub- 
stantial additional savings can be obtained if several of 
the processes do not map the containers, or if thev 
map only some of the containers. 

Although container shipping has six different sys- 
tem calls versus the two ofconvention.il I/O, read and 
write, the actual number of calls a process issues with 
container I/O may be no greater than with conven- 
tional I/O. When data is not mapped, onlv cs_read 
and cs_write calls are required. F,ven if data is mapped, 
it may be possible to perform the mapping through 
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flags ro cs_rcad, without calling cs_map. Un mapping 
is automatic in cs_write, so if cs_unmap is nor used, 
two system calls arc srill sufficient . 

As shown in Figure 3, a process reads data in a con- 
tainer from one device and writes it ro another device. 
Three pages of memorv form one container that stores 
two and one- half pages of data. On input (cs_read), 
the source device deposits tiara into phvsic.il memorv 
pages forming rhe container. The process that owns 
the container may then map (cs_map) it so that the 
data can be manipulated in its address space. The data 
is then output (cs_vvrite) to rhe sink dev ice. Output 
can occur without having mapped the container. 
Mapping can also occur automatically on cs_rcad. 

Eliminating In-Memory Copying 

Unconditionally avoiding the copying or dara within 
memorv during I/O leads ro rhe first of several perfor- 
mance gains from container shipping. Other solutions 
exist that avoid copies only in limited eases. To he uni- 
form and general, copy-free I/O must be possible with- 
out restrictions due ro rhe devices used, the order of 
operations, or the availability of special device hardware. 

In many I/O operations, rhe data requested by a 
user-level process is already in system memory w hen 
rhe request is made. This situation can arise when dara 
is moving between two processes via the I/O system, 
such as is done with pipes. Many optimized file sys- 
tems perform read-ahead and in-mcmoi v caching ro 
improve performance, so file I/O requests mav also be 
satisfied with data that is a I rend v in memorv. Finally, 
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conv entional network adapters transfer entire packets 
into memorv before they are examined by protocol 
Livers in the kernel. Onlv after protocol processing can 
this data be delivered ro the correct user-level process. 
When requested data is already in memory, the onlv 
possible copy-free transfer mechanism that allows full 
read/write access in the address space of a process is 
v irtual memorv remapping. Techniques rhar rclv on 
device-specific characteristics such as programmable 
DMA or outboard protocol processors cannot provide 
uniform, device-independent copy-free I/O, because 
these mechanisms cannot transfer data that is already 
in memorv. 

Using virtual memorv remapping, container ship- 
ping can perform copy-free I/O regardless of when 
or where dara arrives in memorv, and with or without 
any special device hardware that might be available. 
Virtual memory hardware is employed to control the 
ownership of, and access to, memory rhar contains 
I/O data. Ownership and access rights are transferred 
between domains when container I/O is performed, 
while dara sirs motionless in memory. This technique 
requires no special assistance from dev ices and applies 
ro interprocess communication as well as all physical 
I/O. Because user-level processes retain complete 
access to I/O data with no in-mcmorv copying, user- 
level programming remains a practical solution for 
high-performance systems. 

The Gain/Lose Model 

In container I/O, reading and writing are coupled 
with the gam and loss of memorv. We chose the 
gain/lose model because it is simple and provides 
higher performance without sacrificing protection. 
Shared memory is a more complicated alternative to 
the gain/lose model, which also avoids dara copying. 
The use of shared memory ro allow a set of processes 
ro efficiently communicate, however, reduces rhe 
protection between domains. Shared-memory I/O 
schemes also tend to be complicated because of the 
close coordination required between a user process and 
rhe kernel when they both manipulate a shared data 
pool. Since data is nev er shared under rhe gain/lose 
model, protection domains need not be compromised, 
and less user/kernel cooperation and trust is required. 

The gain/lose model has three major implications 
for programmers. First, a process must dispose of I/O 
data that it gains, or memory consumption may grow 
rapidly, One way ro dispose of (.lata is to perform a 
cs_wrire operation on it, so a process performing 
marched reads and vv rites on a stream of data will not 
accumulate any extra memory. Second, ro avoid seri- 
ously complicating conventional memory models, not 
all memory is eligible for use in vvrire operations. For 
example, writing data from the stack would leave an 
inconvenient hole in that part of the virtual memory, 
so rhis is nor allowed. Finally, because data that is 
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written is losr, a writing process must copv anv data 
that will still be needed after the write. Fortunately, 
applications that move great volumes or" data often 
have no further need for it after a write is completed. 

Implications of Virtual Memory Remapping 

In addition to the use of the gain/lose model, the 
decision to use virtual memory remapping has sub- 
stantial implications for the design and use of an I/O 
system. Several changes are unavoidably visible to pro- 
grammers. For example, data can no longer be placed 
exactlv at any requested location in an address space. 
Virtual memory remapping can change the virtual 
page in w hich a physical page of memory appears, but 
it cannot realign data within a page. Furthermore, 
mapping can rearrange memory only at page bound- 
aries. The exact location where incoming I/O data is 
placed is determined by the kernel. After a read opera- 
tion is complete, a process can discover the address of 
the data and access it at that address. 

Some kinds of I/O place data in memory in a form 
that differs from the way it is presented to user-level 
processes. For example, network packets mav arrive 
with media-level headers that are not seen by higher 
levels. These packers may also arrive out of order, or 
in fragments that collectively form a single message. 
Without help from an outboard protocol processor 
or the use of in-memorv copying, these packets cannot 
be linearized. With container shipping, a process mav 
be required to accept a message that consists of multi- 
ple fragments in memory. The semantics of the com- 
munication do nor change, but the data representation 
differs. This issue is less troublesome for writes, because 
kernels typically use internal structures to reorganize 
netw ork data without copying it. The mbufs found in 
UNIX are an example of such a kernel structure. 

Virtual memory remapping is not a simple tech- 
nique, and it must be used w ith care to achieve high 
performance. Although remapping a page is almost 
always faster than copying it, remapping also con- 
sumes rime. This time comes from kernel virtual mem- 
ory bookkeeping and from side effects (such as 
translation lookaside buffer [TLB) flushes) of address 
space changes. For these reasons, container shipping 
makes all mapping optional. Some operating systems 
such as Mach perform lazy mapping, using the page 
fault mechanism to map pages when thev are first 
accessed." This technique avoids unnecessary map 
operations but incurs the extra penalty of having to 
map on demand while a program waits for access 
to data. Taking one page fault for every page in a large- 
region, as is common in modern systems, is particu- 
larly expensive. Furthermore, lazv mapping still 
requires the setting of page table entries (and possibly 
other data structures) to prepare for the possibility 
of page faults, which can be costly for very large data 
objects. This cost is avoided in container shipping. 



Optimizations 

The container shipping design makes possible opti- 
mizations bevond copv and map elimination. Some 
make use of the fact that I/O often flows through 
pathways that are predictable. Other optimizations are 
possible on a per-container basis. 

High-speed I/O is often generated bv long-running 
processes, such as multimedia applications, real-time 
data processing, or processes that run for a long rime 
merely by v irtue of processing a verv large data object 
(common in Sequoia applications). This I/O typically 
flows through pathways in the system that are essen- 
tially static. Data enters through one device, moves 
through a fixed set of domains, and leaves through 
another device. Kernel aw areness of this locality can be 
used to optimize some container operations. 

An I/O path through which same-sized containers 
move repeatedly offers the opportunity to recycle 
containers and their associated data structures. Per- 
rransfer cost can be reduced bv reusing the same set of 
pages and reusing page tables and address space. To 
perform recycling, the kernel can keep track of Which 
containers were given to which processes, or the ker- 
nel can match up recycled containers bv size or bv 
device type. 

In a system with a large secondary cache, promptly 
recycling a just-written container mav allow its reuse 
while its data is still in the cache. In the best case, all 
data mav be automatically cached because of this recy- 
cling. For example, DMA operations in DEC] 3000- 
series systems update the secondary cache. Because 
this cache is much faster than main memory, the data 
can now be accessed more quickly. 

Even without identifying an I/O pathway, careful 
tracking of the contents of container memory pages 
can allow savings in security-driven zero fills. A just- 
freed page consists entirely of sensitive data; the entire 
page must be cleaned before it can be given to anv 
other user. But if this page is used as the target of a 
data-generating operation such as a DMA, only the 
part not overwritten needs to be zeroed. Furthermore, 
this zeroing can be postponed until the data is mapped; 
thus it may be avoided completely. If filling memory 
with zeroes causes it to be loaded in the cache, zeroing 
immediately before the map offers a cache benefit, 
because the data may be used shortly after it is mapped . 

Container Shipping Implementation and Performance 

Container shipping has been implemented in DFX" 
OSF/1 version 2.0 (now Digital UNIX) on Alpha- 
powered DEC 3000-series workstations. All six system 
calls are supported, and container I/O can be mea- 
sured in a variety of situations. Conventional UNIX 
I/O remains, so a system can boot and run normally, 
using container I/O only for specific experiments. 

In our early paper, we showed significant through- 
put improvements for container-based interprocess 



Digital Technical Journal 



Vol. 7 No. 3 1 095 



communication (IPC) within rhc UL'l'RIX version 
4.2a operating system on a DKCsration 5000/200 
svstem.' With the new DEC OSF/1 implementation 
on Alpha workstations, we compared the I/O perfor- 
mance of conventional UNIX I/O to that of container 
shipping for .1 variety of I/O devices as well as IPC;. 
These experiments are described in detail elsewhere.-' 
Large improvements in throughput were observed, 
from 40 percent for FDDI network I/O (despite large 
non-data-touching protocol and dev ice-driver ox er- 
heads) to 700 percent for socket-based IPC. 

We devised an experiment that exercises both rhe 
IPC and I/O capabilities of container shipping. 
Images (640 X 480 pixels, 1 byte per pixel ) are sent by 
one process and received by a second process using 
socket IPC. The receiver process then does output to a 
frame buffer to display rhe images on the screen. This 
is a common application in the Sequoia project: view- 
ing an animation composed of images displayed at 
a rare of up to 30 frames per second (fps). In fact, sci- 
entists often want to view as many simultaneously 
displayed animations as possible. 

We carried out this experiment first using conven- 
tional UNIX I/O (i.e., read and write) and then using 
container shipping ( i.e., es_read and cs_\vrite). Figure 4 
shows rhe throughput obtained for a sender process 
transferring data to a receiver process, which then out- 
puts the data to a frame buffer. The improvement of 
container shipping over UNIX I/O is almost 400 per 
cent. Assuming the maximum 30 fps rare, conven- 
tional I/O supports the full display of one animation 
and container I/O supports six. In general, rhe greater 
the relative speed between an I/O device and mem- 
ory, the greater the relative throughput of container 
shipping versus UNIX I/O will be. 

Related Work 

The use of virtual transfer techniques to avoid the 
performance penalty of physical copying goes back 
to TENEXV" 1 Mach (like TEN KX) uses virtual copy- 
ing, i.e., transferring a data object by mapping it in 
the new address space, and then physically copying if 
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rhe data is modified (copy-on-vvrite).-- This differs 
from container shipping, which uses virtual moving; 
i.e., rhe data object leaves rhe source domain and 
appears in rhe destination domain, where it can be 
read and written without causing fault handling, 
which is expensive. If the original domain wants to 
keep a copy, it may do so explicitly. Thus, container 
shipping places a greater burden on the programmer 
in return for improved perform ..aire. 

The two systems that are mosr similar to container 
shipping are DASH and Fbufs. -" Containers are simi- 
lar to the I PC pages used in DASH and the fast buffers 
used by Fbufs. DASH provides high-performance 
interprocess communication: it achieves fast, local IPC 
by means of page remapping, which allows processes 
ro own regions of a restricted area of a shared address 
space. The Fbufs system uses a similar technique, 
enhanced by caching rhe previous owners of a buffer, 
allowing reuse among trusted processes and elimi- 
nating memory management unit (MMU) updates 
associated with changing buffer ownership. The dif- 
ferences between these two systems and container 
shipping arc examined in detail elsewhere. 2 ' 

Peer-to-Peerl/O 

In addition to container shipping, we have investi- 
gated an alternative I/O svstem software model called 
peer-to-pcer I/O (PPIO). As a direct result of the 
structure of this model, its implementation avoids 
rhe well-known overheads associated with data copy- 
ing. Unlike other solutions, PPIO also reduces the 
number of context-switch operations required ro per- 
form I/O operations. In contrast to container ship- 
ping, PPIO is based on a streaming approach, where 
data is permitted to flow between a producer and con- 
sumer (these may be devices, files, etc.) without pass- 
ing through a controlling process' address space. In 
PPIO, processes use rhe splice system call to create 
kernel-maintained associations between producer and 
consumer. Splice represents an addition to the conven- 
tional operating system I/O interfaces and is nor a 
replacement for the read and write system functions. 

The Splice Mechanism 

The splice mechanism is a system function used to 
establish a kernel-managed data path directly between 
I/O device peers. Ir is rhc primary mechanism rhar 
processes invoke ro use PPIO. With splice, an applica- 
tion expresses an association between a dara source 
and sink directly to the operating system through the 
use of file descriprors. These descriptors do nor refer ro 
memorv addresses (i.e., they are nor buffers): 

sd = splice ( f d 1 , f d 2 ) ; 

As shown in Figure 5, the call establishes an in-kernel 
(.lata path, i.e., a splice, between a data source and sink 
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device. If the I/O bus and the devices support hard- 
ware streaming, the data path is directly over the bus, 
avoiding svstem memory altogether. Although the 
process does nor necessarily manipulate the data, 
it controls the size and timing of the dataflow. To 
manipulate the data, a processing module can be 
downloaded either into the kernel or directlv on the 
devices if thev support processing. 

The data source and sink device are specified by the 
references fd 1 and fd2, respectively. The splice descrip- 
tor sd is used in subsequent calls to read or write to 
control the How of data across the splice. For example, 
the following call causes size byres of data to flow from 
the source ro the sink: 

s p L i c e_c t r l_m s g s c ; 

sc. op = SPLICE_OP_STARTFLOW; 

sc. increment = size; 

write (sd, 8sc, sizeof(sc)); 

Data produced bv the devices referenced by fd 1 is auto- 
maticallv routed ro fd2 without user process interven- 
tion, until size bvtes have been produced at the source. 
The increment field specifies the number of bvtes to 
transfer across a splice before returning control to the 
calling user application. When control is returned, 
dataflow is stopped. A SPLK'.H_C)P_STARTFLOW 
must be executed to restart dataflow. The increment 
represents an important concept in PPIO and refers to 
the amount of data the user process is willing to have 
transferred bv the operating system on its behalf. 
In effect, it specifies the level of delegation the user 
process is willing ro give to the system. Specifying 
SP I . I ( '. H_ I NIC REM F. N T_ D F F A U I T indicates the svs- 
tem should choose an appropriate increment. This is 
generally a buffer size deemed convenient bv the oper- 
ating svstem. 



The splice mechanism eliminates copy operations to 
user space bv nor reiving on buffer interfaces such as 
those present in the conventional I/O functions read 
and write. Bv eliminating the user-level buffering, ker- 
nel buffer sharing is possible. More specifically, when 
block alignment is not required by an I/O device, a 
kernel-level buffer used f»r data input may be used 
subsequently for data output. 

In addition to remov ing the buffering interfaces, 
splice also combines the read/write functionality 
together in one call. The splice call indicates to the 
operating system the source and sink of a dataflow, 
prov iding sufficient information for the kernel ro man- 
age the data transfer by itself without requiring user- 
process execution. Thus, context switch operations 
for data transfer are eliminated. This is important: con- 
text switches consume CPU resources, degrade cache 
performance by reducing locality of reference, and 
affect the performance of virtual memory by requiring 
TLB invalidations. 2 '"* 

For applications making no direct manipulation of 
I/O data (or for those allow ing the kernel to make 
such manipulations), splice relegates the issues of man- 
aging the dataflow (e.g., buffering and flow control) 
to the kernel. Data movement may be accomplished 
bv a kernel-level thread, possiblv activ ated by comple- 
tion events (e.g., device interrupt) or operating in a 
more synchronous fashion. Flow control may be 
achieved by selective scheduling of kernel threads or 
simplv bv posting reads only to data-producing 
devices when data-consuming peers complete I/O 
operations. A kernel-level implementation provides 
much flexibility in choosing which control abstraction 
is most appropriate. 

One criticism of streaming- based data transfer 
mechanisms is that thev inhibit innovation in applica- 
tion development by disallowing applications direct 
access ro I/O data." v Howev er, many applications that 
do not require direct manipulation of I/O data can 
benefit from streaming (e.g., data-retrieving servers 
that do not need to inspect the data they have been 
requested to deliver ro a client). Furthermore, for 
applications requiring well-known data manipulations, 
kernel-resident processing modules (e.g., Ritchie's 
Streams) or outboard dedicated processors are more 
easily exploited within the kernel operating environ- 
ment than in user processes. In fact, PPIO supports 
processing modules.' 1 

PPIO Implementation and Performance 

The PPIO design was conceived ro support large data 
transfers. The decoupling of I/O data from process 
address space reduces cache interference and elimi- 
nates most data copies and process manipulation. 
PPIO and the accompanying splice svstem call have 
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been implemented within the UI.TRJX version 4.2a 
operating system for the DEC 5000 series work- 
stations, and within DEC OSF/1 version 2.0 for DKC 
3000 series (Alpha-powered) workstations, each for 
a limited number of devices. 

Three performance evaluation studies of PPIO 
have been carried out and are described in our carlv 
papers.'''' They indicate CPU availability improves bv 
30 percent or more; and throughput and latency 
improve by a factor of two to three, depending on 
the speed of I/O devices. Generally, the latency and 
throughput performance improvements offered bv 
PPIO improve with faster I/O devices, indicating that 
PPIO scales well with new I/O device technology. 

Improving Network Software Throughput 

Network I/O presents a special problem in that the 
complexity of the abstraction layer (see Figure 2), a 
stack of' network protocols, is generally much greater 
than that for other types of I/O In this section, we 
summarize the results of an analysis of overheads for 
an implementation ofTCP/IPvvc used in the Sequoia 
2000 project. The primary bottleneck in achieving 
high throughput communication for TCP/IP is due 
to data-touching operations: one expected culprit is 
data copying (from kernel to user space, and vice 
versa); another is the checksum computation. Since we 
have already focused on how to avoid data copying in 
the previous two sections, we discuss how one can 
safely avoid computing checksums for a common case 
in network communication. 

Overhead Analysis 

We undertook a study ro determine what bottlenecks 
might exist in TCP/IP implementations to direct us in 
our goal of optimizing throughput. The full study is 
described elsewhere.'' 

First, we categorized various generic functions com- 
monly executed by TCP/IP (and UDP/IP) protocol 
stacks: 

■ Checksum: the checksum computation for UDP 
(user datagram protocol) and TCP 

■ DataMove: any operations that involve moving 
data from one memory location to another 

■ Mbuf: the message-buffering scheme used bv 
Berkeley UNIX-based network subsystems 

■ ProtSpec: all protocol-specific operations, such as 
setting header fields and maintaining protocol state 

■ DataSfruct: the manipulation of various data struc- 
tures other than mbufs or those accounted for in 
the ProtSpec category 

■ OpSvs: operating system overhead 



■ ErrorChk: The category of checks for user and sys- 
tem errors, such as parameter checking on socket 
system calls 

■ Other: This final category of overhead includes all 
the operations that are too small to measure. Irs 
time was computed by taking the difference 
between the total processing time and the sum of 
the times of all the other categories listed above. 

Other studies have shown some of these overheads 
to be expensive. 12 :i 

We measured the total amount of execution time 
spent in the TCP/IP and UDP/IP protocol stacks as 
implemented in the DEC ULTRIX version 4.2a kernel, 
to send and receive IP packets of a wide range ofsi/.es, 
broken down according ro the categories listed above. 
All measurements were taken using a logic analv/.er 
attached to a DECstation 5000/200 workstation con- 
nected to another similar workstation bv an FDDI LAN 1 
attached through a Digital DEFZA FDDI adapter. 

Figure 6 shows the per-packet processing rimes 
versus packet size for the various overheads for UDP 
(lackers. These are for a large range of packet sizes, 
from 1 to 8,192 bvtcs. One can distinguish two differ- 
ent types of overheads: those due to data-touching 
operations (i.e., data move and checksum) and those 
due to non-data-touching operations (all other cate- 
gories). Data-touching overheads dominate the pro- 
cessing time for large packers that typically contain 
application data, whereas non— data-touching opera- 
tions dominate the processing rime for small packers 
that typically contain control information. Generally, 
data-touching overhead rimes scale linearly with 
packet size, whereas non-data-touching overhead 
rimes are comparatively constant. Thus, data-touching 
overheads present the major limitations to achieving 
maximum throughput. 

Data-touching operations, which do identical work 
in the TCP and UDP software, also dominate process- 
ing rimes for large TCP packets." 

Minimizing the Checksum Overhead 

As can be seen in Figure 6, the largest bottleneck ro 
achieving maximum throughput (i.e., which one 
achieves bv sending large packets) is the checksum 
computation. We applied two optimizations to mini- 
mize this ov erhead: improv ing the implementation of 
the checksum computation, and avoiding the cheek- 
sum altogether in a special but common case w here w e 
felt we were nor compromising data integrity. 

We improved the checksum computation imple- 
mentation bv applying some fairly standard tech- 
niques: operating on 32-bit rather than 16-bit w ords, 
loop unrolling, and reordering of instructions to 
maximize pipelining. With these modifications, we 
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reduced the checksum computation time bv more 
than a factor of two. Figure 7 shows that the overall 
throughput improvement is 37 percent. The through- 
put measurements were made between two 
DECstation 5000/200 systems communicating over 
an FDDI network. Overall throughput is still a frac- 
tion of the maximum FDDI network bandwidth 
(100 Mb/s) because of data-copying overheads and 
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machine-speed limitations. See Reference 6 for 
detailed results. 

A very easv way of significantly raising TCP and 
UDP throughput is to simply avoid computing check- 
sums; in fact, many systems provide options to do just 
this. The Internet checksum, however, esists for a 
good reason: packets are occasionally corrupted 
during transmission, and the checksum is needed to 
detect corrupted data. In fact, the Internet Engineer- 
ing Task Force (IETF) recommends that systems not 
be shipped with checksumming disabled by default." 

Ethernet and FDDI networks, however, implement 
their ow n evelic redundancy checksum (CRC). Thus, 
packers sent directly over an Ethernet or FDDI net- 
work are already protected from data corruption, at 
least at the level provided by the CRC. One can argue 
that for LAN communication, the Internet checksum 
computation does not significantly add to the machin- 
ery for error detection already provided in hardware. 

Thus, our second optimization was simplv to elimi- 
nate the software checksum computation altogether 
when computing the checksum would make little 
difference. Consequently, as parr of the implementa- 
tion of the protocol, when the source and destina- 
tion are determined to be on the same LAN, the sort- 
ware checksum computation is avoided. Figure 7 
shows the resulting 74 percent improvement in 
throughput over the unmodified ULTRIX version 
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4.2a operating svstcm, and a 27 percent improvement 
over the implementation with the optimized check- 
sum computation algorithm. 

Of course, one must he vcrv eareh.il about deciding 
w hen the Internet checksum is of minimal value. We 
believe it is reasonable to turn off checksums when 
crossing a single network that implements its own 
CRC, especially w hen one considers the performance 
benefits of doing so. In addition, since the destinations 
of most TCP and UDP packets are within the same 
LAN on w hich they are sent, this policy eliminates the 
software checksum computation for most packets. 

Our checksum elimination policy differs somew hat 
from traditional TCP/IP design in one aspect of pro- 
tection against corruption. In addition to the protec- 
tion between network interfaces given bv the Ethernet 
and FDDl checksums, we require a software checksum 
in host memory as a protection from errors in data 
transfer over the I/O bus. For common dev ices such 
as disks, how ever, data transfers over the I/O bus are 
routinely assumed to be correct and are nor checked in 
software. Therefore, a reduction in protection against 
I/O bus transfer errors for network devices does not 
seem unreasonable. 

Turning off the Internet checksum protection in 
any w ider area context seems unwise w ithout consid- 
erable justification. Notall networks are protected bv 
CRCs, and it is difficult to see how one might check 
that an entire routed path is protected bv CR.Cs with- 
out undue complications invoking IP extensions. 
A more fundamental problem is that network CRCs 
protect a packet onlv between network interfaces; 
errors mav arise while a packer is in a gateway machine. 
Although such corruption is unlikely for a single 
machine, the chance of data corruption occurring 
increases exponentially w ith the number of gateways 
a packet crosses. 

Summary and Conclusions 

We described various solutions to achieving high per- 
formance in operating svstcm I/O and network soft- 
ware, w ith a particular emphasis on throughput. Two of 
the solutions, container shipping and peer-to-peer I/O, 
focused on changes in the I/O system software struc- 
ture to avoid data copying and other overheads. The 
third solution focused on the avoidance of additional 
data-touching overheads in TCP/I P network software. 

Container shipping is a kernel service that prov ides 
I/O operations for user processes. High performance 
is obtained bv eliminating the in-memorv data copies 
traditionally associated with I/O, without sacrificing 
safetv or reiving on devices with special-purpose func- 
tionality. Further gains are achieved by permitting the 
selective accessing (mapping) of data. We measured 



performance improvements over UNIX of 40 percent 
(network I/O) to 700 percent (socket IPC). 

PPIO is based on the hypothesis that the memory - 
oriented model of I/O present in most operating sys- 
tems presents a bottleneck that adversely affects overall 
performance. PPIO decouples user-process execution 
from inrerdevice dataflow and can achieve improve- 
ments in both latency and throughput over conven- 
tional svstems bv a factor of 2 to 3. 

Finally, we considered the special case of network 
I/O where data moving/copying is not the only major 
overhead. We showed that the checksum computation 
is a major source ofTCP/IP network processing over- 
head. We improved performance bv optimizing the 
checksum computation algorithm and eliminating 
the checksum computation when communicating ov er 
a single LAN that supports its own (UK!, improving 
throughput bv 37 percent to 74 percent for UDP/IP. 
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